Free Expression Online – The Citizen Lab https://citizenlab.ca University of Toronto Wed, 08 Oct 2025 16:18:10 +0000 en-CA hourly 1 An Analysis of Chinese Censorship Bias in LLM https://citizenlab.ca/2025/08/an-analysis-of-chinese-censorship-bias-in-llm/ Thu, 14 Aug 2025 19:53:15 +0000 https://citizenlab.ca/?p=82577 Read more »]]> In this paper, the Citizen Lab’s Mohamed Ahmed and Jeffrey Knockel examine Chinese censorship bias in LLMs with a censorship detector they designed as part of the research. They warn that when LLMs are trained on state-censored texts, their output is more likely to align with the state. 

An Analysis of Chinese Censorship Bias in LLMs was published in the Privacy Enhancing Technologies Symposium (PETS) 2025 proceedings.

]]>
Hidden Links: Analyzing Secret Families of VPN Apps https://citizenlab.ca/2025/08/hidden-links-analyzing-secret-families-of-vpn-apps/ Thu, 14 Aug 2025 19:44:35 +0000 https://citizenlab.ca/?p=82573 Read more »]]> In this paper co-authored by the Citizen Lab’s Jeffrey Knockel, researchers investigate the secret relationships between VPN operators and the vulnerabilities these VPNs share. The authors warn that the obfuscation of these relationships prohibits consumers from making informed decisions about their digital security and misleads them about the security properties of the VPNs. 

Hidden Links: Analyzing Secret Families of VPN Apps was published in the Free and Open Communications on the Internet 2025 proceedings, part of the Privacy Enhancing Technologies Symposium (PETS).

]]>
Techno-Legal Internet Controls in Indonesia and Their Impact on Free Expression https://citizenlab.ca/2025/06/techno-legal-internet-controls-in-indonesia-and-their-impact-on-free-expression/ Thu, 05 Jun 2025 14:34:55 +0000 https://citizenlab.ca/?p=82393 Read more »]]> Irene Poetranto examines Indonesia’s use of domain name system (DNS) redirection as a method of internet censorship in a new essay published by the Carnegie Endowment for International Peace.

In Techno-Legal Internet Controls in Indonesia and Their Impact on Free Expression Poetranto explains how DNS redirection, a new type of DNS tampering, was introduced in Indonesia following the establishment of Indonesia’s national DNS. She discusses what this means for internet users in Indonesia.

Read the essay.

 


Irene Poetranto is a senior researcher at the Citizen Lab.

This essay was published on May 28, 2025, in Digital Democracy in a Divided Global Landscape by the Carnegie Endowment for International Peace.

]]>
كتب محظورة: تحليل الرقابة على موقع Amazon.com https://citizenlab.ca/2024/11/analysis-censorship-amazon-com-ar/ Mon, 25 Nov 2024 21:37:46 +0000 https://citizenlab.ca/?p=81477 Read more »]]> المؤلفون: جيفري نوكل، جاكوب داليك، نورا الجيزاوي، محمد أحمد، ليفي
ميليتي، وجاستن لاو.

ملاحظة للقراء: هذا المستند هو ترجمة غير رسمية وهو نسخة مختصرة من التقرير الكامل تتضمن أقساماً مختارة. لذا يرجى الملاحظة أنه لا يشمل التحليل الشامل والمناقشات التفصيلية الموجودة في النسخة الأصلية. وكونه ترجمة غير رسمية، قد يحتوي على بعض الأخطاء أو التفسيرات غير الدقيقة. تهدف هذه الترجمة فقط إلى تقديم فهم عام لأبحاثنا. وفي حالة وجود أي تناقض أو غموض، فإن النسخة الإنجليزية الأصلية للتقرير هي النسخة المعتمدة. يمكن العثور على النسخة الإنجليزية الأصلية هنا

النتائج الرئيسية

  • قمنا بتحليل النظام الذي تستخدمه شركة أمازون على واجهة متجرها في الولايات المتحدة “amazon.com” لتقييد شحن منتجات معينة إلى مناطق محددة. وجدنا أن أمازون قد فرضت قيودًا على شحن 17,050 منتجًا إلى منطقة واحدة على الأقل في العالم.
  • في حين أن العديد من قيود الشحن ترتبط باللوائح التنظيمية المتعلقة بشبكات WiFi، ومقاعد السيارات، وفئات أخرى من المنتجات التي تخضع لتنظيم صارم، كانت الكتب الفئة الأكثر شيوعاً من المنتجات التي فرضت أمازون قيوداً عليها في دراستنا.
  • كانت الكتب المحظورة تتعلق بشكل كبير بموضوعات مثل مجتمع الميم-عين (LGBTIQ)، والتنجيم، والأدب الإباحي، والمسيحية، والصحة. وشملت المناطق المتأثرة بهذا النوع من الرقابة الإمارات العربية المتحدة، والمملكة العربية السعودية، والعديد من الدول الأخرى في الشرق الأوسط، بالإضافة إلى بروناي دار السلام، وبابوا غينيا الجديدة، وسيشيل، وزامبيا. في عينة الاختبار الخاصة بنا، وجدنا أن أمازون قد فرضت رقابة على أكثر من 1.1% من الكتب المعروضة للبيع على موقع amazon.com في منطقة واحدة على الأقل من هذه المناطق.
  • حددنا ثلاث قوائم رئيسية للرقابة تستخدمها أمازون لمناطق مختلفة. وفي العديد من الحالات، كانت الرقابة الناتجة إما واسعة النطاق بشكل مفرط أو ذات تصنيف خاطئ. وتشمل الأمثلة حظر كتب تتعلق بسرطان الثدي، وكتب الوصفات التي تستخدم تعابير مجازية مثل “إباحية الطعام“، وكتاب “Gay Science”العلم المثلي” أو المترجم إلى العربية بعنوان “العلم المرح” لنيتشه، وحلوى “قوس قزح” من مينتوس.
  • و لتسويغ عدم إمكانية شحن المنتجات المقيدة، تستخدم أمازون رسائل خطأ متعددة، مثل إشعار أن المنتج غير متوفر في الوقت الراهن. ومن خلال تضليل عملائها وفرض رقابة على الكتب، تنتهك أمازون التزاماتها العلنية تجاه مجتمع الميم-عين وحقوق الإنسان على نطاق أوسع.
  • نختتم تقريرنا بتقديم عددٍ من التوصيات إلى أمازون لمعالجة المخاوف التي أثارتها نتائج بحثنا هذا.

 

هل تمتد رقابة أمازون إلى كتب كيندل؟

على الرغم من أن دراستنا كانت تتعلق باختبار الرقابة على المنتجات المادية، بما في ذلك الطبعات الورقية للكتب، وما إذا كان من الممكن شحنها إلى مناطق مختلفة، فقد خلص اختبارنا الأولي للكتب الإلكترونية على كيندل أن أمازون تطبق الرقابة على كتب كيندل وأن هذه الرقابة تستند إلى المنطقة الجغرافية للمستخدم كما هو محدد في حساب المستخدم على كيندل. (كما هو موضح في الشكل 1)

الشكل 1: خيار تغيير المنطقة الجغرافية لمستخدم كيندل في خيارات تفضيلات حساب أمازون
هناك حاجة إلى المزيد من البحث لفهم كيفية مقارنة الرقابة على الكتب الإلكترونية من أمازون بالرقابة على الكتب الورقية، على سبيل المثال، فيما إن كانت تشمل ذات الكتب ونفس المناطق. نترك هذه الأسئلة لمزيد من الأبحاث في المستقبل.

أسئلة لأمازون

أرسلنا رسالة لأمازون في 23 تشرين الأول (أكتوبر) 2024 تتضمن مجموعة من الأسئلة المتعلقة بسياسات تقييد الشحن التي تتبعها الشركة، و تعهدنا بنشر ردهم بالكامل. يمكن قراءة الرسالة الأصلية بالانكليزية [here]
حتى تاريخ نشر هذا التقرير 25 تشرين الثاني (نوفمبر) 2024 ، لم نتلقَّ أي ردّ من أمازون.

توصيات لأمازون

ختاماً نقدم أربع توصيات لأمازون لمعالجة المخاوف التي أثارها هذا التقرير:

  1. توفير إشعارات شفافة ودقيقة للعملاء عندما تكون المنتجات غير متوفرة بسبب القيود القانونية في منطقة الوجهة. يجب ألا يتلقَّ العملاء رسائل مضللة تخفي الأسباب الحقيقية لعدم إمكانية شحن منتجاتهم.
  2. إبلاغ المستخدمين بالقوانين ذات الصلة التي تنطبق على هذه القيود. يمكن للمستخدمين الذين يتم إعلامهم بالقوانين التي تقيّد مشترياتهم اتخاذ قرارات أفضل بشأن المنتجات التي فشل نظام التصفية في تقييدها وتحديد المنتجات التي تم تصنيفها على نحوٍ غير صحيح.
  3. توفير آلية للعملاء للإبلاغ عن المنتجات التي تم تصنيفها بشكل خاطئ على أنها غير قانونية في منطقة الوجهة. يتعين على أمازون تمكين المستخدمين من الإبلاغ عن المنتجات التي تم تقييدها بشكل خاطئ لتتمكن أمازون من مراجعتها وإزالة القيود عنها، حيثما اقتضى الأمر.
  4. مراجعة المناطق التي يتم تطبيق كل فئة من فئات الرقابة عليها. لا ينبغي تصنيف المناطق ضمن قوائم حظر الرقابة بشكل عشوائي، بل يجب مراجعة كل فئة رقابة بشكل دوري لضمان ملاءمتها لكل منطقة يتم تطبيقها عليها، على سبيل المثال الرقابة على محتوى الميم-عين في سيشيل.

الشكر والتقدير

نود أن نشكر جيديديا كراندال، وأيرين بويترانتو، وآدم سينفت على مراجعتهم القيمة للتقرير؛ آلي بروس على تحرير التقرير؛ وماري تشو على المساعدة في الرسوم البيانية و تصميم التقرير؛ وسنيغدا باسو و آلي بروس على دعمهنّ في التواصل. كما نشكر رشا يونس التي ألهمتنا محادثتها معنا بشأن الرقابة التي تفرضها أمازون على amazon.ae وamazon.sa لإجراء هذا البحث. أُجري هذا البحث بإشراف من د. رون
ديبرت.

]]>
Banned Books: Analysis of Censorship on Amazon.com https://citizenlab.ca/2024/11/analysis-of-censorship-on-amazon-com/ Mon, 25 Nov 2024 14:59:25 +0000 https://citizenlab.ca/?p=81192 Key findings
  • We analyze the system Amazon deploys on the US “amazon.com” storefront to restrict shipments of certain products to specific regions. We found 17,050 products that Amazon restricted from being shipped to at least one world region.
  • While many of the shipping restrictions are related to regulations involving WiFi, car seats, and other heavily regulated product categories, the most common product category restricted by Amazon in our study was books.
  • Banned books were largely related to LGBTIQ, the occult, erotica, Christianity, and health and wellness. The regions affected by this censorship were the UAE, Saudi Arabia, and many other Middle Eastern countries as well as Brunei Darussalam, Papua New Guinea, Seychelles, and Zambia. In our test sample, Amazon censored over 1.1% of the books sold on amazon.com in at least one of these regions.
  • We identified three major censorship blocklists which Amazon assigns to different regions. In numerous cases, the resulting censorship is either overly broad or miscategorized. Examples include the restriction of books relating to breast cancer, recipe books invoking “food porn” euphemisms, Nietzsche’s Gay Science, and “rainbow” Mentos candy.
  • To justify why restricted products cannot be shipped, Amazon uses varying error messages such as by conveying that an item is temporarily out of stock. In misleading its customers and censoring books, Amazon is violating its public commitments to both LGBTIQ and more broadly human rights.
  • We conclude our report by providing Amazon multiple recommendations to address concerns raised by our work.

Introduction

The rise in online shopping has led to more global reach into markets that may otherwise be inaccessible for companies through traditional retail channels. This increased reach brings new opportunities but also has its own challenges for global e-commerce retailers. One such challenge is in dealing with different, more restrictive regulatory environments worldwide.

In this report, we analyze American e-commerce retailer Amazon and its system for preventing shipments of certain products to certain world regions as it is implemented on the US storefront — amazon.com. Specifically, we analyze functionality that Amazon implements to restrict shipments of certain products to certain regions even if the product is available and sellers are offering to ship it there. While Amazon normally hides this restriction system from customers using misleading error messages, we employ a novel methodology to uncover and measure on which products and in which regions it is activated by peeling back the layers of Amazon’s website and analyzing its internal workings. Notably, our method can distinguish between a product being restricted by Amazon and it being organically unavailable in a region.

In total, we found 17,050 products that were restricted from being shipped to at least one world region. While many of the shipping restrictions observed in our study are related to regulations involving WiFi, car seats, and other heavily regulated product categories, the most common product category restricted by Amazon was books. Banned books were largely related to LGBTIQ, the occult, erotica, Christianity, and health and wellness. More broadly, books were the victims of censorship, which in this report we define as Amazon’s restriction of product shipment under political or religious motivation. The regions commonly affected by this censorship were the United Arab Emirates (UAE), Saudi Arabia, and many other Middle Eastern countries as well as Brunei Darussalam, Papua New Guinea, Seychelles, and Zambia.

Given that the topics censored include LGBTIQ, our findings call into question Amazon’s public commitment to LGBTIQ rights as well as its respect for the rights of its users at large. By censoring the availability of books, Amazon is depriving its users of valuable information. Furthermore, by communicating to customers that censored products are organically unavailable (e.g., being out of stock), Amazon is depriving customers of the ability to make informed decisions. We conclude our report by making multiple recommendations to Amazon.

Background

In this section we briefly describe Amazon’s history as it relates to our analysis. We then outline some of the regulations applying to Amazon’s business in Saudi Arabia, the UAE, and China, which are some of the more restrictive regulatory environments to which products on amazon.com can be shipped.

Amazon background

Amazon is an American multinational company that originated as an online bookseller and has since evolved into a global e-commerce marketplace. Amazon’s business is heavily focused on managing shipping logistics internationally and serving a global consumer base. Alongside the main e-commerce platform, they also provide cloud computing services (Amazon Web Services), consumer electronics (Amazon Kindle and Amazon Echo), and online streaming (Amazon Prime Video) among other offerings.

Amazon is best known for its original website — amazon.com — which serves as the landing page for US customers, although items can be shipped globally depending on seller preferences. As of 2024, there are dedicated storefronts for 22 other regions. Alongside the online expansions to other regions there has been an analogous expansion of physical infrastructure in those regions including shipping hubs, fulfillment centers, sorting facilities, and delivery stations.

Most relevant to our study, Amazon has expanded its dedicated storefronts to include the UAE in 2017 and Saudi Arabia in 2020. This expansion included opening a regional headquarters in Riyadh, Saudi Arabia, in 2022 and a fulfillment center in Dubai, the UAE, in 2023. These recent expansions into the Middle East create their own unique challenges to the retailer because of the region’s distinct regulatory regimes, which we detail below.

Compliance with international regulations

Amazon polices the products sold on its platform, and their own shipping restrictions FAQ provides some guidance on why certain products may be restricted, including the need to “comply with all laws and regulations and with Amazon policies” and that Amazon may be “restricted from shipping to your location due to government import/export requirements, manufacturer restrictions, or warranty issues”. Amazon has adapted its policies to allow for the removal of offensive content including content that Amazon determines is “hate speech, promotes the abuse or sexual exploitation of children, contains pornography, glorifies rape or pedophilia, advocates terrorism”, but also “other material [they] deem inappropriate or offensive”. However, Amazon has failed to reveal specifically what categories of content it restricts to comply with the demands of authoritarian governments.

There have been reported incidences where Amazon complied with governments’ requests to restrict certain products or even go as far as manipulate its reviews. For example, Amazon restricted items for purchase and in search results relating to over 150 keywords relating to LGBTIQ content in the UAE after receiving pressure from the government to remove them. In China, Amazon removed all customer ratings and reviews for a book of Chinese president Xi Jinping’s speeches and writings. In both instances, Amazon claimed that they were following local laws and regulations. However, in India, internal Amazon documents showed that Amazon was circumventing local regulations by providing preferential treatment to certain sellers and by also promoting its own merchandise by rigging search results. Amazon has also been criticized for allowing its platform to spread white supremacy and racism. Items with Nazi symbols and Kindle books associated with neo-Nazis and white supremacists have remained widely available despite Amazon having been notified by journalists and non-profit organizations.

Regulations in Saudi Arabia

In Saudi Arabia, content is largely governed by two laws: the 2003 Law of Printing and Publication, largely regulating print media, and the 2007 Anti-Cyber Crimes Law, regulating online media. Article 9 of the Law of Printing and Publications states that printed media cannot contravene Sharia Law, stir up internal discord, injure the economic and health situation of the country, or lead to a breach of either public security, public policy, or foreign interests. Article 18 states these regulations should apply to the importation and distribution of printed materials. An approval is required, within the framework of Article 18 of the Printing and Publication law, in order to certify that content is free from any content that is insulting to Islam, the government, interests of the Emirates, or ethical standard and public morality. In terms of enforcement, Article 39 states that any contravening printed items can be withdrawn from circulation if they are found to violate either Articles 9 or 18.

The 2007 Anti-Cyber Crimes Law is chiefly focused on regulations around information security and content regulation. Article 6 of this law states that “production, preparation, transmission, or storage of material impinging on public order, religious values, public morals, or privacy, through an information network or computer” is a criminal offense. Contravening this article can lead to a maximum punishment of five years in prison and a maximum fine of three million riyals (approximately 800,000 USD). This law has been applied against online content. For example, in 2019, Saudi Arabia alerted Netflix that an episode of Hasan Minhaj’s comedy show Patriot Act violated this statute as it contained criticism of a Saudi Arabian royal. Netflix complied with the government order and restricted access to the episode for Saudi Arabian users.

Regulations in the UAE

In the UAE, content is governed by the Federal Decree-Law No.55 of 2023 on Media Regulation, which replaced the previous Federal Law No.15 of 1980 Concerning Publications and Publishing. Specifically, it regulates print, television, as well as online media. Another relevant regulation is the Internet Access Management Regulatory Policy which focuses on the regulation of online content. Under this policy, the only two internet service providers (ISPs) in the UAE, Etisalat and Du, are required to block online content if requested by the Telecommunications and Digital Government Regulatory Authority. Prohibited Internet content includes pornography, contempt of religion, and promotion of or trading in prohibited commodities and services. Category 13 of the policy prohibits sites from promoting or trading in commodities prohibited or restricted by licenses in the UAE, including “prints, paintings, photographs, drawings, cards, books, magazines, and stone sculptures, which are contrary to the Islamic religion or public morals, or involving intent of corruption or sedition”.

Compliance with Chinese demands

In 2004, Amazon entered the Chinese market via its acquisition of Joyo, a Chinese online bookstore. Amazon faced scrutiny for its political censorship of products on its Chinese site — amazon.cn. However, facing competition from domestic rivals, Amazon terminated its online store in China in 2019, although for a limited time overseas products were still sold on the amazon.cn site. Amazon still has other operations in China, such as Amazon Web Services (AWS), which is Amazon’s cloud computing service. Outside of China, in 2021, on the US Amazon storefront — amazon.com — Amazon partnered with China International Book Trading Corp, a state-owned firm that has been labeled as “China’s state propaganda arm”, to create a portal for selling books that amplify the Chinese Communist Party’s agenda.

Methodology

In this section, we explain how we determine product availability across different regions. Our methodology consists of two phases. As our original motivation was to understand how Amazon censorship applies to Middle Eastern countries, in our first phase, we focus on studying how products and shipment restrictions vary across multiple countries in the Middle East. We were particularly motivated to understand the differences between restrictions imposed on the shipment of products to Middle Eastern countries in which Amazon operates a storefront (namely, the UAE and Saudi Arabia) versus those in which it does not. To understand how censorship applies more broadly to the world at large, in our second phase we pivot from the results of the first phase and measure product availability in regions across the globe.

In designing our methodology, we were motivated by eliminating false positives, even if doing so might introduce false negatives. The rationale is that we would rather omit the measurement of some instances of censorship rather than falsely attribute censorship to products that are not censored.

In the remainder of this section we explain the two phases of our methodology.

Phase 1: Measuring censorship in Middle East

One way to try to measure Amazon censorship in Middle Eastern countries would be to visit those Amazon storefronts which are available in the Middle East, namely, the UAE’s amazon.ae or Saudi Arabia’s amazon.sa, and to try to determine which products are anomalously “missing” from being sold on these two Amazon sites. This approach, however, would be limited. For example, if we saw one book related to LGBTIQ topics that was sold on amazon.com but not amazon.ae, that might be due to the book being censored on amazon.ae, but another possibility is that the book was out of stock or not sufficiently popular to be sold in some countries. However, if we saw a disproportionately large number of books related to LGBTIQ topics that were available on amazon.com but not sold on amazon.ae, then we would have a stronger argument, but this argument would be at best a statistical argument, and for any individual product we would not be able to prove whether it was the victim of censorship or was unavailable on that storefront for some other reason.

Given the weakness of the previously described approach, we instead measured whether products on amazon.com, the American storefront, could be shipped to various countries. As an additional benefit, this approach allowed us to study censorship in regions that did not have their own dedicated storefront. For our investigation of censorship in the Middle East, we picked four Middle Eastern countries: the UAE, Saudi Arabia, Qatar, and Yemen. We also tested a fifth country, Canada, as a control, which we explain later in our methodology.

To test which products we could ship to these five countries, we required a method for sampling a sufficiently diverse set of Amazon products to test. To address this requirement, we made use of the Common Crawl data set provided by the Common Crawl Foundation. This data set is a diverse, open Internet-wide sample of Web pages scraped beginning in 2008. In April 2023, we downloaded all of the archives up to and including the February/March 2023 archive. To avoid excessive storage requirements of storing the entire data set, we downloaded the archive in streaming fashion, filtering out any Amazon product URL into a file without storing any other data from the data set. We processed the Common Crawl data from 2013 through March 2023, as March 2023 was the most recent data set available at the time that we began our testing. Although we were only interested in products available on the amazon.com storefront, since products are often available on multiple storefronts, we collected products from the 23 Amazon dedicated storefronts that were in use at this time.

Using this method, we collected a list of 114,542,719 Amazon URLs. Since not every Amazon URL is a URL to a product, we processed this URL list by searching each URL with the following regular expression:

/(?:dp|gp/product|gp/aw/d|gp/switch-language/product|product-reviews|asin|offer-listing|kindle-dbs/product)/([^/]*)(?:/|$)

This regular expression was designed to search for and detect a variety of ways that Amazon inserts Amazon Standard Identification Numbers (ASINs), Amazon’s unique product identifiers, into URLs and extract them from the URL. The result of this processing was a list of 19,074,613 unique ASINs.

Amazon’s location selector after having selected “Saudi Arabia”.
Figure 1: Amazon’s location selector after having selected “Saudi Arabia”.

To gather information on the availability of products in our five tested countries, we sequentially test ASINs in each region using an automated program to perform the following steps. First, we load amazon.com. Then we switch our location to the region that we are testing using Amazon’s location selector (see Figure 1). For each ASIN, we navigate to https://www.amazon.com/dp/[ASIN]/ to display that product’s detail page. We then parse that page for that product’s availability status. Note that at no point in our methodology do we sign into any Amazon account. If the product’s availability in a region is any of the following, we consider the product unavailable in that region:

  • This item cannot be shipped to your selected delivery location
  • Currently unavailable
  • Temporarily out of stock

While a product might be unavailable in a region due to legal or regulatory restrictions, there are also more benign reasons for a product being unavailable. Many such reasons are even alluded to in the above messages, such as sellers no longer shipping to that region or the product being out of stock among sellers shipping to that region.

As such, we are specifically interested in measuring which products cannot be shipped to a region even if there are shippers who have it in stock and are willing to ship it to that region. We call such products in that region restricted products since, even if they were in stock and there were shippers willing to ship them, Amazon would still restrict users from shipping them to that region.

Attempting to add restricted products to the Amazon cart in the all offers display results in “Not added” error messages.
Figure 2: Attempting to add restricted products to the Amazon cart in the all offers display results in “Not added” error messages.

To discern which unavailable products are restricted, we exploit a special side channel to reveal if Amazon is preventing the unavailable product from being shipped to us. Namely, we perform the following additional steps via our automated program for any product found to be unavailable. First, we browse to https://www.amazon.com/dp/[ASIN]/?aod=1. Note that, compared to the previous URL we had browsed to, this one has appended to it the “?aod=1” query string. Enabling the “aod” parameter signals to Amazon that we want Amazon to render the all offers display (AOD). This advanced display lists all offers from shippers both willing to ship to the user’s specified region and who have the product in stock. Next to each shipper’s option is a button to add that offer to one’s cart (see Figure 2). We automate our program to click all of the “Add to cart” buttons on the AOD. We measure the number of buttons whose clicks resulted in the “Added” versus “Not added” messages. If there is at least one offer and all attempts to add offers to our cart result in a “Not added” error, we consider the product potentially restricted to our configured location. We schedule another test to run a week after the original, and, if that test has the same result (i.e., that there is at least one offer and all attempts to add offers to our cart result in a “Not added” error), then we consider that product restricted to the tested region. If there are no offers (i.e., there are no buttons to click), then we are unable to discern between the product being restricted in the tested region versus being unavailable for benign reasons such as being out of stock (see Table 1 for a summary of possible results). By exploiting how AOD status messages leak whether products are restricted, we are able to learn more about Amazon’s system of restricting product shipments to certain regions.

Results from clicking “Add to cart” buttons Interpretation
At least one “Added” Product available
All and at least one “Not added” Product restricted
There are no “Add to cart” buttons to click Indeterminate

Table 1: Summary of possible results from clicking “Add to cart” buttons and their interpretations.

Since many products on amazon.com do not ship internationally, we perform the following optimization to improve testing throughput. When testing unavailable products in each of the four countries for whether they are restricted, we skip testing products that are unavailable in Canada. We chose Canada as our control because of its geographical and legal similarity to the United States. This optimization reduced the number of unavailable products that we needed to test by over 85%. We were motivated to reduce the number of unavailable products that we needed to test because this process of the testing was the most time consuming.

Phase 2: Measuring censorship globally

In our Phase 1 methodology, we outlined the steps for how we determine which products are restricted to the UAE, Saudi Arabia, Qatar, and Yemen. However, Amazon supports delivering to 239 countries and other regions across the globe. In Phase 2, we now expand our measurement targeting the Middle East to a global measurement by limiting the number of products that we test in each region. We do so by feeding our results from Phase 1 into Phase 2.

Specifically, we are interested in the set of products that are restricted in at least one of the four Middle Eastern countries we analyzed. From this set, we created our Phase 2 test list by choosing 1,000 of these products uniformly at random with replacement. For each product in our Phase 2 test list, we perform a similar test as we did in Phase 1, except instead of only testing in four Middle Eastern countries we test whether that product is unavailable in and restricted in all 239 regions to which Amazon supports shipping. By doing this, we hope to have a broader understanding of how Amazon censorship is applied across the globe, at least as it relates to any product censorship that we had previously measured in the Middle East.

Experimental setup

We coded an implementation of our methodology in Python using the Selenium Web browser automation framework and executed the code on an Ubuntu Linux machine. We tested each search platform from a University of Toronto network. Phase 1 was performed from April 2023 to December 2023. Phase 2 was performed from May 2024 to June 2024.

Phase 1 Results

In this section we detail results from Phase 1 of our experiment.

Products tested

During our testing period, we were able to test product links collected in the Common Crawl dataset from the February/March 2023 archive working backwards until and partially including the September 2019 archive. Overall we tested 5,870,695 product links during this phase of the experiment. Among these, 2,005,852 (34%) were not (or were no longer) valid product pages, resulting in Amazon “Page Not Found” errors. Recall that many of the ASINs that we acquired were from dedicated storefronts other than the United States. Therefore, although many of these links may be to products no longer sold, most of these “Page not found” products are likely products that were never available on the US amazon.com storefront, only on the storefronts of other countries. In addition to the aforementioned “Page not found errors”, an additional 19,968 product links generated Amazon “Sorry! Something went wrong!” errors. Therefore, among the 5,870,695 product links tested, we tested 3,844,875 actual products.

Internal consistency of methodology

Across our regions tested, many products only had one offer (i.e., one seller offering to ship the product), yet others had as many as 93 offers (see Figure 3 and Table 2). In our methodology, we only consider a product restricted if all of its offers result in a “Not added” status. However, as some products only have one offer, we wanted to measure the consistency of results concerning products with multiple offers to gauge the reliability of results concerning products with a single offer. Specifically, we wished to approximate how reliable testing products with a single offer is by measuring how internally consistent testing products with multiple offers is. We do so by looking at the number of products whose offers had statuses which disagreed, namely, where at least one resulted in an “Added” status and at least one resulted in a “Not added” status.

Histogram of the numbers of offers per region.
Figure 3: Histogram of the numbers of offers per region.
# of unavailable products with…
zero offers one offer > one offer
Yemen 429,571 19,074 5,198
Saudi Arabia 134,737 13,316 11,901
UAE 133,641 14,804 21,943
Qatar 51,455 8,876 6,997
TOTAL 749,404 56,070 46,039

Table 2: Summary of the numbers of offers per region.

Among our four countries of interest, we observed only 11 conflicting results: three in the UAE, one in Saudi Arabia, and seven in Yemen. We did not observe any noticeable trend in the type of products with conflicting results except that none had any clear motivation for being restricted (see Table 3 for a listing).

Table 3: Products with conflicting results among offers.

Ten out of the 11 products had at least as many “Added” results as “Not added” results. Together with there being no clear rationale for their restriction, the “Not added” results are likely false positives. Since we only consider a product restricted if all of its offers result in “Not added” messages, we correctly interpret these cases of mixed results as being negative cases of restriction. There may be false positives which we were unable to detect, especially for products with only one offer available. However, given the low frequency of these false positives, namely, that among the 46,039 products tested with at least two offers, only 11 showed possible false positives, we can suspect an equally low rate among the products with only one offer. Specifically, if we assume the same false positive rate as what we had measured, then, among the 56,070 products with only one offer, we would expect only between 13 and 14 false positives.

Exclusion criteria

After beginning our Phase 1 experiment we noticed that many of the products that could not be added to our cart in the all offers display also presented an additional diagnostic message to the left of the “add to cart” button, stating (see Figure 4 for an example):

“This item cannot be shipped to your selected delivery location. Please choose a different delivery location.”

Ultimately, varying by region, we found that between 16% and 47% of each region’s restricted products had at least one offer with this additional diagnostic message.

A product offer that cannot be added to the Amazon cart with an “item cannot be shipped” message to its left.
Figure 4: A product offer that cannot be added to the Amazon cart with an “item cannot be shipped” message to its left.

By looking at the nature of the products with these messages, the reason for these items’ inability to be shipped seemed unrelated to any type of religious or political censorship that Amazon might be applying. As such, in Phase 1, we exclude from analysis any offer in the all offers display that had this message, but, if some of the other offers for a product do not include this message, then we do not exclude those offers.

Censorship comparison

We saw the largest number of restricted products in the UAE, followed by Saudi Arabia, then Qatar. We observed the lowest number of restricted products in Yemen (see Table 4 for details).

Region # of known restricted products
UAE 13,604
Saudi Arabia 9,590
Qatar 6,086
Yemen 1,817
TOTAL (unique) 17,050

Table 4: The number of known restricted products in each Phase 1 region studied.

We note, though, that this method of comparing the absolute number of known restricted products may be biased. Specifically, if some regions have more products generally available, such regions may appear to be more restricted due to having more restricted products as well. Therefore, a region with a larger number of known restricted products may not necessarily have a higher rate of restriction.

To more fairly compare Amazon’s restriction of products across these regions, we chose to generate Venn diagrams. In generating these diagrams, we only consider products for which there was at least one offer available in every region. In other words, we only consider products for which we have a clear yes or no result concerning whether it was restricted in every region featured in the diagram. We do this because we do not want our results concerning the number of restricted products to be biased toward regions in which we have more known results. Since each diagram features different regions, the totals therefore may not be consistent across diagrams due to this method of comparison.

Comparison of overlap of restricted products between the UAE and Saudi Arabia.
Figure 5: Comparison of overlap of restricted products between the UAE and Saudi Arabia.

Between the UAE and Saudi Arabia, we found that the UAE restricted the largest number of products with around half of the products restricted in UAE also being restricted in Saudi Arabia (see Figure 5). We found that 6,360 products were restricted in common by the UAE and Saudi Arabia.

Left, comparison of overlap of restricted products among the UAE, Saudi Arabia, and Qatar; right, comparison of overlap of restricted products among the UAE, Saudi Arabia, and Yemen.
Figure 6: Left, comparison of overlap of restricted products among the UAE, Saudi Arabia, and Qatar; right, comparison of overlap of restricted products among the UAE, Saudi Arabia, and Yemen.

Comparing Qatar to the UAE and Saudi Arabia, we found that fewer products were restricted in Qatar and that almost all products restricted in Qatar were also restricted in Saudi Arabia and that most of the products restricted in Qatar were also restricted in the UAE (see Figure 6). Comparing Yemen to the UAE and Saudi Arabia, we found that fewer products were restricted in Yemen and that almost all products restricted in Yemen were restricted in Saudi Arabia (see Figure 6).

Comparison of overlap of restricted products between Qatar and Yemen.
Figure 7: Comparison of overlap of restricted products between Qatar and Yemen.

Finally, comparing Qatar to Yemen, we found that Yemen had the fewest number of restricted products and that almost all products restricted in Yemen were also restricted in Qatar (see Figure 7).

Notably, we found different levels of restriction among these four regions despite their cultural and geographic proximity in the Middle East. Namely, the UAE featured the highest level of restricted products, followed by Saudi Arabia, then Qatar, with Yemen having the fewest. Since we found that the UAE and Saudi Arabia feature the most products and because almost all products restricted in Qatar and Yemen were also restricted in either the UAE or Saudi Arabia, we focus the remainder of our analysis on the UAE and Saudi Arabia.

Analysis of availability messaging

In our dataset, products that are restricted by Amazon presented different and inconsistent messages to the user. The messaging around the rationale for why a certain product is restricted could inform the user as to the reason why items are unavailable. However, among all of the restricted products identified by this study, none presented a message to the user explaining that the items were unavailable due to regulatory reasons. Instead, each item was communicated as either being “currently unavailable”, “temporarily out of stock”, or that “this item cannot be shipped to your selected delivery location”. The messaging presented to users for these restricted items is vague and unclear.

Percentage of messages returned by Amazon for restricted items in the UAE and Saudi Arabia
Figure 8: Percentage of messages returned by Amazon for restricted items in the UAE and Saudi Arabia

The message presented to users was also inconsistent (see Figure 8). In reviewing each restricted item we could not determine why attempting to ship some restricted items resulted in one message versus another. Instead, Amazon used more generic and sometimes misleading terminology when communicating that restricted items cannot be shipped.

Analysis of censorship in Saudi Arabia and the UAE by Amazon product category

Figure 9 shows the overall products by categories that are restricted in Saudi Arabia and the UAE. This is based on Amazon’s own categorization of a product, which can include both specific categories (e.g., “Book -> Science Fiction”) as well as more general categories (e.g., “Book -> Genre Fiction”). The restricted products are dominated largely by book-related categories, mainly “Genre Fiction” and “New Age & Spirituality”, which are the top one and two restricted categories respectively in both countries. Japanese Manga and Fantasy book categories are also present in the top 15 restricted products by categories in both countries. The book-related categories that are most restricted in Saudi Arabia include “Thrillers & Suspense”, “Erotica”, “Christian Living”, “Anthologies”, and “Action & Adventure”. Non-book-related product categories restricted in Saudi Arabia include “Cell Phone Cases”, “Screen Protectors”, and “Groceries”. In the UAE the book categories most restricted include “Thrillers and Suspense”, “French”, “Occult & Paranormal”, “Mystery”, and “Short Stories”. There are no non-book categories represented in the top 15 restricted products in the UAE.

Top 15 restricted product categories in Saudi Arabia (top) and the UAE (bottom). Note that the top categories in the UAE are entirely book-related categories. Saudi Arabia’s top categories are all but three book-related categories.
Figure 9: Top 15 restricted product categories in Saudi Arabia (top) and the UAE (bottom). Note that the top categories in the UAE are entirely book-related categories. Saudi Arabia’s top categories are all but three book-related categories.

Although our Phase 1 experiment began in April 2023, ending in December 2023, it was only on June 6, 2023, that we began capturing the Amazon category for products that we did not find restricted in any region. Considering only the books that we tested since that time, we find that Amazon censored 8,965 out of the 796,081 books in that sample in at least one of Saudi Arabia, the UAE, Qatar, or Yemen. Therefore, we estimate that Amazon applies censorship to over 1.1% of the books sold on amazon.com. Since our method cannot find all instances of censorship, this estimate is only a lower bound, and the real proportion may be much larger.

Analysis of censorship in Saudi Arabia and the UAE by censor motivation

In the previous section, our analysis was limited to understanding the motivation behind Amazon’s shipping restrictions by looking at the categories Amazon assigns each restricted product. While this kind of analysis tells us which products are restricted, it does a poor job of describing the motivations for why it is restricted. To address this gap, we conduct a qualitative analysis, employing a more nuanced approach to decipher the underlying reasons for these restrictions.

We selected a random sample of 200 products, 100 items restricted from shipment to the UAE and another 100 items restricted from shipment to Saudi Arabia, to understand the breadth of Amazon’s shipping restrictions across various product categories. This random selection sought to minimize the bias and ensure representativeness.

We analyzed items from this random sample based on their titles and descriptions. In some cases when the product information was not listed in the description, we conducted further background research to understand the product. We categorized them based on the actual nature of the products and the perceived category of the products. The latter refers to a category we inferred based on potentially sensitive keywords within the products’ description or title that may be triggering Amazon’s algorithms to perceive it as being under a category. As one example, Nietzsche’s Gay Science contains the word “gay” in the title, suggesting that it was censored for containing the word “gay”, even though the book is a philosophical work that does not speak to LGBTIQ topics. This dual categorization was designed to uncover discrepancies between a product’s apparent content and its perception by Amazon’s algorithms (see Figures 10 and 11).

Perceived reason for restriction vs. actual category of product in the UAE among 100 randomly chosen, restricted products.
Figure 10: Perceived reason for restriction vs. actual category of product in the UAE among 100 randomly chosen, restricted products.
Perceived reason for restriction vs. actual category of product in Saudi Arabia among 100 randomly chosen, restricted products.
Figure 11: Perceived reason for restriction vs. actual category of product in Saudi Arabia among 100 randomly chosen, restricted products.

Both categorizations consisted of the following categories: “LGBTIQ”, “Occult”, “Erotica”, “Overbroad”, “Christianity”, “Health & Wellness”, and “Other”. The actual category of a product that we assign is based on our analysis of the items’ titles and descriptions, whereas the perceived category is also based on the identification of certain keywords believed to trigger Amazon’s censorship algorithm and influence the miscategorization of some items.

Our review of items restricted in the UAE and Saudi Arabia highlights Amazon’s possible miscategorization of items. This is especially concerning as miscategorization ultimately has the effect of imposing unnecessary censorship rules onto users. Any case of miscategorization highlights a possibility that these decisions are being made automatically rather than with the proper care and diligence. In the following sections we highlight select items among each of our categories, including suspected cases of over-blocking.

LGBTIQ

Examples of LGBTIQ content from our random sample of restricted products include a movie featuring gay characters—and containing the word “gay” in its title—and a book on the history of the persecution of homosexuals in Nazi Germany.

Other types of LGBTIQ content from outside of this sample that we observed include workbooks relating to sexual orientation and gender identity, trans and genderqueer erotica, books on LGBTIQ criminalization in the US and queer activism, and cookbooks based on queer community-led culinary practices.

However, many products containing the word “rainbow” in their descriptions were censored despite not being otherwise related to LGBTIQ themes. Such products included rainbow-colored hair extensions, a travel case, a video game, a movie DVD, a detective novel called Mister Rainbow, and Mentos rainbow candy. These findings reveal that Amazon’s LGBTIQ censorship is overly broad, capturing products such as candy that are not illegal in any country.

As we noted previously, many books were also collaterally censored for containing the word “gay” in their title or description. As a chief example, Nietzsche’s Gay Science was censored, despite being a philosophical work unrelated to LGBTIQ topics.

Occult

In our random sample, we found books related to the occult and the paranormal including those on tarot, fairy tales, demons, jinn, witchcraft, astrology, crystals, freemasonry, astral projection, and Bigfoot. Much of the children’s books’ censorship seemed motivated by censoring the occult, although it is unclear whether these children’s books were expressly targeted or collateral damage of some larger censorship strategy. For example, in our random sample are children’s books related to jinn, witches, wizards, and necromancy. Our random sample also featured one book describing how to write fictional books relating to monsters as well as a book about Harry Potter.

Although not represented in our smaller random sample, in our larger data set we observed books relating to Thelema, an occultist movement, as well as its founder, Aleister Crowley. Many books related to extraterrestrial aliens and ufology at large were also restricted. We also observed a Dell Alienware laptop that was restricted, although we are unsure if this is due to the product’s allusion to aliens or due to electronics or communications regulations. Notably, while we observed multiple restricted products relating to oracles and divination both inside and outside of our random sample, we also found a large number of books outside of our sample related to the Oracle database software that seem to have also been caught up by the filter, suggesting that books related to oracles are over-censored.

The banning of books, even children’s books, related to the occult or magic is often religiously motivated under the belief that they are demonic or evil. For instance, Catholic schools in Canada and the United States have banned Harry Potter books. Even aliens and UFOs, while seemingly nonreligious, have been suggested by many to be related to demonic visitation, and the idea of extraplanetary visitors challenges a common religious notion of humankind being the center of creation.

Similarly, in the Middle East context, it may not be surprising to find that children’s books discussing magic and witchcraft are also censored in both Saudi Arabia and the UAE. This stance aligns with both countries’ long histories of prosecuting individuals accused of practicing witchcraft and sorcery. For instance, in Saudi Arabia, the Harry Potter books were specifically banned for containing themes perceived as occult, Satanic, depicting violence, and allegedly undermining family values.

Erotica

We identified a significant number of restricted books in our random sample whose censorship was likely motivated by restricting erotica, even though they did not contain erotica, including literature, humor, photography, travel, self-help, classic literature, among others. A notable example is Never Mind the Botox, a book that features the lives of four women working in cosmetic surgery and their daily struggles. Whereas cosmetic surgery is legal in both the UAE and Saudi Arabia, we think that certain keywords in the book description have resulted in its miscategorization under “erotica”.

Another example is Sex Addiction Survival Guide: A Practical Workbook for Reconnecting to Yourself and Others, which is a guideline targeting individuals who are struggling with sex and pornography addiction issues, motivating them to move toward a healthier connection to self and others. Censored, it was most likely flagged as an “erotic” book based on certain words in the title and description, such as (sex, porn, sexual, hypersexuality, etc.).

Finally, Klarissa Dreams Redux: An Illuminated Anthology is a collection of poetry and other writings in the context of Klarissa Kocsis’s paintings. In the description of the book, however, the author is described as a “breast cancer survivor” and as having “earned a reputation for portraiture and nudes”, either of which may be triggering Amazon censorship. Literature relating to breast cancer is common collateral damage for overblocking censorship filters on various platforms.

Like the occult, religious motivations are often behind the banning of erotic books. Schools’ libraries in the US banned books on sex and sexuality, among other topics. Similarly, in the Middle East and North Africa (MENA) region, the long history of bans on erotic content and certain topics considered “taboo” has led to widespread self-censorship among publishers and translators. For instance, Saudi Arabia’s Law of Printed Materials and Publication states in Article 9 that publications are allowed when they “do not violate the provisions of Islamic Sharia”. Although the language of this law is vague, the Executive Regulations of the Publications and Publishing System was more explicit in specifying the prohibited topics, including the “spread of obscenity”.

Christianity

We identified numerous censored books related to Christianity in our random sample. However, since works related to Christianity often deal with topics such as demons or the devil, we believe that these books were collateral censorship, and that Amazon’s censorship filter perceives them as being related to other topics.

For example, many censored books relating to Christianity mention demons or the devil, such as The Soul of The Apostolate. The book emphasizes the success of Christian apostolic work, highlighting that it hinges not merely on activity but fundamentally on a robust interior life. However, it also advertises to reveal the “Devil’s special temptations for those working for Our Lord”. Another example is the censorship of Get Thee Behind Me, Satan: Rejecting Evil which raises questions about biblical facts and their relation to devils, but the work is ultimately concerned with motivating the need for vigilance against the pervasive appeal of evil.

There was also one Christian book likely censored due to erotic themes. Moral Ambiguity is the fictional story of Kevin Gregory, a celebrated singer, exposing the corrupt practices and moral hypocrisy of a powerful televangelist. While the book is primarily concerned with the main protagonist revealing the hypocrisy of religious authority, the book’s description also alludes to the antagonist’s motivations of greed and sex.

Despite many Christianity-related books being restricted from shipment to the UAE, the UAE has been promoting their openness to other religions. The UAE has created the Ministry of Tolerance and Coexistence that aims to “encourage interfaith dialogue” as part of its mission. Notably, the US Department of State’s 2022 Report on International Religious Freedom concerning the UAE underscored that there are available books in the UAE on a variety of topics, including non-Islamic religions and pro-atheism. Therefore, it is unclear why Amazon’s censorship restricts shipment of so many Christian books to the UAE.

Health and wellness

Restricted products include condoms of several brands, sex toys such as vibrators, and sex education, sexual health, and gender identity books and textbooks. Lubricants are also restricted, although automotive greases with the word “lubricant” in their descriptions were also among the restricted products in our random sample.

Preventing access to sexual education materials, contraceptive methods, and other sex-related products can impact the physical and emotional wellbeing of a population, as well as the safety of its most vulnerable members. Studies on Russian sex education campaigns—or the lack thereof—show that hindering one’s contact with sexual health information leads to an increased transmission of sexually transmitted infections (STIs), a rise in sexual assault and harassment, and, among other consequences, the use of abortions as a main method of contraception. Conservatism and the insistence on protecting children’s “innocence” have proven to be the main factors driving the opposition to the implementation of sex education curricula in schools and the censorship of sexual health content in media.

In such places as the MENA region, there is a lack of preventive programs and youth-friendly services. A 2004 study of Lebanese people aged 15–24 shows that condom use was low, at only 37% among those having regular casual sex. Although population-based data remain lacking, the MENA region has been experiencing a rise in STI prevalence among its population. This is a consequence of the lack of preventive programs such as vaccination and school-based sex education.

Other

A large number of products in the “Other” category include those which are heavily regulated, such as products related to WiFi or car seats. This is especially the case in Saudi Arabia, where we also saw a large number of restricted products related to mobile phones. However, the large discrepancy between the number of products in the “Other” perceived category and the “Other” actual category points to the large amount of overly broad filtering which perceives certain products as being related to restricted categories but in actuality they are unrelated.

For a variety of products that we measured we could not identify any reason for their restriction. As one example, we found that Malala Yousafzai’s She Persisted was censored in the UAE and Saudi Arabia. While we might speculate that such a book could be politically sensitive in those countries, it does not fit into any of the above categories of products. Moreover, other Malala Yousafzai titles were available in these countries, so it is unclear if this is intended censorship or if it is collateral damage from some overly broad filtering rule. There may also exist other motivations for product restriction other than those which we could identify.

Other languages

Not all products within our 200 product random sample were in the English language. Thirteen restricted products from non-English language products were also included. These are products in Japanese (six products), French (five products), and German (two products). All restricted non-English media was for written material like books, including many comics, except for one German movie, Könige der Welt, a documentary about addiction and success in the music business. Many of the blocks within other languages can be categorized in themes identified previously such as the occult, erotica or sexual health information.

Many of the restricted Japanese products are Japanese comic books or “manga”. Many of these restricted products contain occult themes or contain violent content. Some restricted manga is marketed to a younger audience such as My Hero Academia and Dragon Quest, which we suspect is restricted due to an overbroad interpretation of the occult. Another suspected miscategorization is a book by Japanese author Mayumi Tanimoto called “キャリアポルノは人生の無駄だ” (Career Porn is a Waste of Life). This book is both a humorous and earnest criticism of the work environment and labor conditions in Japan. We suspect that the characterization of this issue in the context of “career porn” has led to an erroneous miscategorization of the product as being pornographic. This is possibly due to the inclusion of the characters “ポルノ” (porno) in either the title or description.

Restricted French language content includes sexual health information such as a book about the Kama Sutra, a guidebook for performing sexual acts, and a humorous guide to dating. We suspect that two French books were captured by censorship rules targeting the occult: a young adult fantasy novel about the fictitious black magic book the “Necronomicon”, and a book about Tarot card interpretation. The one restricted German book is a children’s fantasy book about a girl named Willow, who loves nature, according to the title. We suspect that this book was also captured by occult-targeting censorship.

Phase 2 results

In this section, we detail results from Phase 2 of our experiment in which we tested across all 239 Amazon regions 1,000 products restricted in at least one of four Middle Eastern countries.

Comparison of product availability across regions

To compare how product availability varied across all 239 regions, we clustered each region according to the inter-similarity of the products that are restricted in each region. Specifically, we compare each product’s results using their hamming distance h(a, b), where h(a, b) is 1 if a = b, 0 otherwise.

To compare two vectors of results, either across every region keeping the product fixed, or across every product keeping the region fixed, we use the distance metric

D(x, y)= ¹⁄ₙ ∑i=1 to n h(xi,yi), where n = |x| = |y|.

Using this metric, we hierarchically clustered each region. The resulting clustered similarity matrices and dendrograms are in Figures 12 and 13. In Figure 12 both axes are regions and each cell represents two regions’ restriction similarity, but in Figure 13 the Y-axis varies over products, and we hierarchically cluster the rows of different products in the same manner as we did the columns of different regions. Each cell in Figure 13 represents whether a product is censored in a region.

Jaccard similarity matrix of each region’s restricted products, hierarchically clustered. Dendrograms help identify seven clusters: the light grey, purple, dark grey, pink, blue, green, and cyan clusters as well as the final black singleton cluster. The height of the tree reflects the distance between the two branches joined using the farthest point linkage method. Therefore, the height of the tree is short inside of each cluster and only becomes tall when joining clusters.
Figure 12: Jaccard similarity matrix of each region’s restricted products, hierarchically clustered. Dendrograms help identify seven clusters: the mauve, purple, magenta, pink, blue, green, and cyan clusters as well as the final black singleton cluster. The height of the tree reflects the distance between the two branches joined using the farthest point linkage method. Therefore, the height of the tree is short inside of each cluster and only becomes tall when joining clusters. See here for the full data set.
Same X-axis as Figure 12 but Y-axis is all 1,000 tested products where each cell indicates the result of the test; green: “Added”, yellow: no offers, red: “Not added”.
Figure 13: Same X-axis as Figure 12 but Y-axis is all 1,000 tested products where each cell indicates the result of the test; green: “Added”, yellow: no offers, red: “Not added”. See here for the full data set.

In Figure 13, we identified the four leftmost clusters as having limited shipping options due to varying degrees of physical or logistical remoteness. At the extreme, some of these locations are remote, unpopulated islands (e.g., Bouvet Island), which would pose obvious shipping challenges. Other locations are not physically remote but are logistically difficult to ship to due to ongoing political instability or military conflicts. As such we refer to these four as different “remote” clusters.

Moving rightward is a singleton cluster consisting of Ukraine followed by a large 195-member cluster with regions we consider to have baseline censorship. Inside of the baseline censorship, there is some variation such as which types of groceries or health supplements can be delivered or whether products from Amazon Global Store UK can be delivered. Generally, however, in this cluster, although some products were restricted, we did not find any that could be categorized as religious or political censorship. Ukraine being clustered in between the collection of four “remote” clusters and the baseline censorship cluster suggests that the 2022 full-scale invasion of Ukraine may have limited courier’s access to the region, although not to the same extent as others.

Zooming in on the 17 members in the lower right of Figure 12, we have Figure 14.

The lower right corner of Figure 12, zoomed in.
Figure 14: The lower right corner of Figure 12, zoomed in. See here for the full data set.

Beginning from the upper left, we first see China, the final member of the baseline censorship cluster. Despite Amazon censoring products in this region, we did not generally observe evidence of censorship to this region in our Phase 2 experiment for reasons which we explain in the “Chinese censorship” section below. However, we do note that while it was in the baseline censorship cluster, it was the region in this cluster that was clustered the closest to the more censored regions to its right.

Moving rightward, we see Jordan, Egypt, Bahrain, Oman, Kuwait, Qatar, Lebanon, Papua New Guinea, Maldives, Zambia, and Seychelles forming a cluster. Brunei Darussalam and Saudi Arabia are also in this cluster. Due to this cluster’s censorship of LGBTIQ, occult, and other topics that do not match that of the UAE’s, we refer to this as the moderate censorship cluster. However, while the dendrogram tree is shallow between Brunei Darussalam and Saudi Arabia, suggesting their censorship is highly similar to each other’s, the dendrogram link between this pair of countries and the remainder of the “moderate censorship” cluster is taller, suggesting that this pair of countries has less in common with the remainder of the cluster than they do with each other. We found that various phone accessories were restricted in Brunei Darussalam and Saudi Arabia. We are unclear if this is the result of some regulation uniquely affecting these regions or if this is the product of some kind of censorship which we do not presently understand.

Continuing rightward, we have the UAE in a singleton cluster. In Phase 1 we identified it as being the most censored region among those analyzed, and our Phase 2 results are similar. As such we identify it as the high censorship cluster.

Finally, we see Belarus and the Russian Federation forming a cluster. Following the 2022 full-scale invasion of Ukraine, Amazon announced suspension of shipment to these countries. As such we refer to this as the sanctioned cluster.

These results show that our clustering technique was more capable of revealing clusters of regions with similar restrictions as well as for uncovering those restrictions themselves. These results also shed light on our previous Phase 1 findings which found that the UAE and Saudi Arabia censored the most, followed by Qatar, with Yemen censoring the least. We can now identify the UAE as belonging to the high censorship group, Saudi Arabia and Qatar as belonging to the moderate censorship group (with Saudi Arabia having additional restrictions affecting, e.g., phone accessories), and Yemen as belonging to the baseline censorship group.

Product categories that are most restricted

With the regions organized into clusters, we now review each cluster’s most restricted product categories. We find that for the moderate censorship, high censorship, and sanctioned clusters the largest restricted category is “Books” (see Figure 15). The UAE has the largest percentage difference (69.47%) between “Books” and the next largest category, “Movies & TV”. Among less censored region clusters the top category is most often (three out of six) “Grocery and Gourmet Food” followed by “Books” (two out of six) and “Automotive” (one out of six). We see that less censored region clusters are more likely to not ship products for regulatory reasons (food and automotive) while more censored clusters are more likely to not ship media (books or movies). There are two notable exceptions to this among remote regions where books and music CDs are the top one and two categories not shipped. Among these regions, this finding reflects that books are a high proportion of the sample set that we tested in Phase 2 as well as the predominant category of product sold on Amazon.

Top two categories by restricted product count in each censorship cluster as a percentage.
Figure 15: Top two categories by restricted product count in each censorship cluster as a percentage.

Censorship masterlists

While our analysis fleshed out various reasons for restrictions on shipping, including the remoteness of unpopulated islands or Amazon’s voluntary sanctions against Russia and Belarus, we note that when we restrict our concern to only politically and religiously motivated censorship, we observe three clusters: the baseline censorship cluster, the moderate censorship cluster, and the high censorship cluster.

We believe that these clusters are explained by the application of three different censorship masterlists which Amazon uses to simplify the process of censorship. By creating masterlists of censored products and assigning regions to each masterlist, Amazon can perform censorship more expeditiously versus applying censorship specifically tailored to each region. This “three sizes fits all” approach may help explain the over-broadness of much of the censorship that we observed.

Chinese censorship

In Phase 2 of our experiment, we observed few products censored when shipping to China. Given China’s stringent regulatory requirements concerning political speech, we investigated why we saw so few products censored in defiance of our expectations.

Left, Tiananmen Papers censored when shipping to China; right, No Escape: The True Story of China’s Genocide of the Uyghurs censored when shipping to China.
Figure 16: Left, Tiananmen Papers censored when shipping to China; right, No Escape: The True Story of China’s Genocide of the Uyghurs censored when shipping to China.

By manually testing items that we suspected might be censored, we found books (see Figure 16) and other products (see Figure 17) that are sensitive in China were censored when shipping to China. However, we did not observe these products censored in our Phase 2 experiment because that experiment only finds products censored in regions that were censored in both the UAE and Saudi Arabia. Since the UAE/Saudi Arabia and China have largely different censorship motivations, we would expect their censorship to have little overlap.

“Free Tibet” bumper sticker (left) and t-shirt (right) censored when shipping to China.
Figure 17: “Free Tibet” bumper sticker (left) and t-shirt (right) censored when shipping to China.

However, we believe that the methodology employed in this report would work for further exploring censorship in China if China had been studied during Phase 1. We leave such an investigation to future work.

Incompletely applied Russian sanctions

Despite Amazon’s announcement of suspension of shipments to Russia and Belarus, we found multiple products that could be shipped to these regions, including a mystery novel, a Nativity toy play set, and a tube of seafood-flavored cat toothpaste. We can identify no other commonality among these and other products which we could ship to Russia and Belarus. However, they point to the incomplete application of Amazon’s suspension of shipments to these regions.

Revisiting “item cannot be shipped” products

In Phase 1, we excluded from analysis all offers for products with an “item cannot be shipped to your selected delivery location” message displayed in the all offers display. We did this to focus on products which were affected by religious or political censorship. In Phase 2, seeking a more holistic understanding of product availability on Amazon, we did not perform such exclusions, and thus we can now speak to the cause of these messages.

In Figure 13, orange cells in the matrix represent products that would have been excluded in Phase 1 due to having all of their offers showing the “item cannot be shipped” message. There are two large representations of such products, one being a horizontal, orange bar at the top of the matrix and other being vertical, orange bars in some of the “remote” clusters. Analyzing the products in the horizontal bar, we found that they were shipped from Amazon Global Store UK. We are unsure why such products are restricted in so many regions, especially when the same products can be shipped to these regions from amazon.co.uk, the UK Amazon storefront. The vertical bars in the “remote” region clusters can be explained similarly as the products being restricted in these regions.

There are still many unanswered questions, such as why some regions show this message versus others, such as Belarus and Russia in the sanctioned cluster, do not. However, without completely understanding the exact reasons for such distinctions, we can still use these distinctions as useful signals for clustering regions and products by various restriction motivations, including censorship or other motivations which may be more benign, such as appears to be the case for the “item cannot be shipped” products.

Censorship churn

During Phase 2, we tested all regions, including the four countries that we had previously tested in Phase 1. Since we conducted Phase 2 over four months after Phase 1, Phase 2’s experiment also provided us the opportunity to measure the amount of churn, i.e., the change in results between these experiments in those countries we measured in Phase 1. We were interested in all possible changes, including between being available versus restricted but also changes to or from having no offers, which is the ambiguous case where we cannot confidently conclude whether a product is available or restricted. Below we quickly summarize some of our observations.

Given that our Phase 1 data set consisted of products that were predominantly restricted in the four regions we tested, most of the changes were from being restricted to being some other state. While we had hoped to find evidence that Amazon had been improving their matching criteria, the evidence for that is mixed. For instance, we had found that Nietzsche’s Gay Science was censored in both the UAE and Saudi Arabia during Phase 1. This book is seemingly collateral damage of Amazon’s censorship, since the book title contains the word “gay” but its topic does not pertain to LGBTIQ issues. However, during Phase 2, the book was available in the UAE, but in Saudi Arabia the book simply shows no offers for it in the all offers display despite it having had five (censored) offers in Phase 1. This suggests that Amazon may have an additional form of censorship where it hides all offers as opposed to displaying them but not allowing them to be added to users’ carts.

We found more evidence of this phenomenon. For instance, in two cases we found that both The Joy of Sex and Witches of Pennsylvania had 29 (censored) offers to ship to the UAE in Phase 1 but had no offers to the UAE in Phase 2. It seems unlikely that, in each case, all 29 shippers independently decided to stop shipping to the UAE. Rather, these findings suggest that many censored products on Amazon may simply show no offers for them at all.

Since our methodology cannot currently distinguish such cases of censorship (versus a product organically having zero offers), it is possible that Amazon’s censorship extends beyond what we have measured. In other words, our study may be underestimating the magnitude of censorship on Amazon.

Comparing to book bans across the US

Through parent-led advocacy and state legislation, a wave of books have been banned or otherwise challenged across schools and libraries in the US. These bans have overwhelmingly targeted books that discuss themes of race, gender, and sexuality, and have disproportionately censored stories written by and about people of color and LGBTIQ people. Since the most commonly restricted product category on Amazon was books, we designed an experiment to test these banned books using our previous methodology to see which are censored and whether overlap exists between the censorship of books in US schools and libraries and the censorship of books being shipped to the UAE and Saudi Arabia.

Methodology

The nonprofit organization PEN America published a list of 3,362 instances of individual books banned, affecting 1,557 unique titles in schools and libraries across the US during the 2022–2023 school year. For each unique item in the list, we performed an advanced search on amazon.com of the book’s International Standard Book Number (ISBN). If the ISBN search failed, we would do a book search of the book’s title and the author name. Then we would follow the link in the first of the search results and, if applicable, would select paperback, or hardcover if there was no paperback button. We then saved the title of the book, the author’s name, ASIN, and URL to a file. Of the 1,557 unique titles, we were able to find 1,533 on Amazon using this method.

We implemented the above methodology programmatically using Python and the Selenium Web browser automation framework. We used the Google Books API to find the ISBN number of each book. We executed the code on February 20, 2024, on a MacBook running macOS.

Results

We ran each of the 1,533 ASINs through our Phase 1 methodology on February 21, 2024. We found 65 unique books (4%) were censored when attempting to ship them to Saudi Arabia or the UAE, with 54 of them being listed as “temporarily out of stock” and 11 as “currently unavailable”.

Censorship comparison

Similar to our Phase 1 results, Amazon censored more booked when shipped to the UAE than to Saudi Arabia. All of Saudi Arabia’s censored books were also censored in the UAE (see Table 5). As with our Phase 1 results, we found that the messaging was vague and inconsistent. The only pattern we found was that no restricted item resulted in the “this item cannot be shipped” message which is in keeping with our previous finding that this message, while used for restricting product shipments, is not used for Amazon’s political or religious censorship.

Temporarily out of stock Currently unavailable This item cannot be shipped
UAE 54 11 0
Saudi Arabia 8 0 0
Total unique 54 11 0

Table 5: The number of books returning each availability status across each region.

Analysis of censorship by censor motivation

To examine the motivation behind the censorship of the books we used the same method of analysis as in Phase 1. Since books tend to feature a number of themes we base this analysis on the perceived category that a censor may arrive at based on just the description and title of the book. We categorized each book into one of five categories we used previously in this report: “Occult”, “Erotica”, “LGBTIQ”, “Christianity”, or, if it did not fall into the previous four, “Other”. Figure 18 shows the proportion of censored books in each category.

Total number of restricted books in each category.
Figure 18: Total number of restricted books in each category.

We categorized all of the books based on keywords in their titles or descriptions that may have motivated Amazon to restrict the book’s shipment to the UAE or Saudi Arabia. Books that were categorized as occult largely contained words like “devil”, “demon”, and “magic” in their titles and descriptions. Books like Gender Queer and All Out: The No-Longer-Secret Stories of Queer Teens throughout the Ages were also likely censored for containing LGBTIQ content in their titles.

Comparison of motivations

PEN America’s analysis of the contents of banned books in US schools and libraries found that the top five categories of content that were challenged were violence, health and wellbeing, sexual experiences between characters, racialized characters and themes, and LGBTIQ characters and themes.

Since the motivations of the groups challenging books across the US are different from Saudi Arabian or Emirati information controls, we would expect the perceived categories of books censored to also contain differences. For instance, we found very little evidence of books being restricted for discussing race. Similarly, a number of books that were discussing health and wellness may have been perceived as erotica by Amazon for their discussions of sexual health.

However, we also found that censorship of books in the US had common motivations with Amazon’s censorship in the UAE and Saudi Arabia, specifically with books that contain LGBTIQ stories and themes, health and wellness content, and books that contain sexual content. In both cases, these books are often misrepresented as erotica or pornography. This argument has been used both by those attempting to remove books from schools and libraries, as well as by these governments when censoring the media. Organizations that advocate for book banning often use this kind of hyperbolic and misleading language to argue that they are protecting children by preventing them from accessing these books.

Additionally, another high-level commonality has to do with the censor’s lack of familiarity with the content they are censoring. As discussed above, it is unlikely that Amazon censors each book based on its content in itself. We believe that Amazon largely relies on text describing the book in the title and description to determine if a book is restricted or not in a region. Similarly, book challenges in US school libraries are often a result of a lack of familiarity with the content, with many who challenge books using excerpts taken out of context or using talking points provided by advocacy groups.

Lastly, just as censorship is legislated in Saudi Arabia and the UAE, advocacy pressure in the US has resulted in a series of state laws that enforce which books can and cannot be in schools and libraries. This state legislation has “supercharged” the work of groups organizing book bans, with 63% of book bans during the 2022–2023 school year taking place in the eight states that had enacted legislation to regulate access to books.

Limitations

In this section we enumerate and evaluate various limitations of the methodologies we employed in this report.

During our Phase 1 experiment, we derived our list of products to test from the Common Crawl dataset. This sampling of products may be more likely to include high traffic and longer-lived products. While this may even be a desirable property, there may also exist other biases introduced by sampling from the Common Crawl dataset that we have not anticipated.

During our Phase 2 experiment, our test set was derived from items that we found restricted in the UAE and Saudi Arabia. As regulatory regimes vary between countries, we suspect that other countries are very likely to have categories of restricted products not captured by this test set. China was one such country whose censorship this methodology did not capture. Therefore, while our analysis in this phase effectively compares censorship between regions of a common set of products, it cannot be interpreted as exhaustively enumerating the categories of products censored in every region.

In our testing we found that there existed some categories of product whose motivations for restricting were unclear (e.g., smartphone cases). If we had better knowledge of Amazon’s technical mechanism for filtering, we might understand such categories of products to be collateral damage of overly broad filtering rules. However, there might also exist unexplored legal or regulatory reasons for such products’ restrictions.

In our study we encountered some limitations relating to language diversity. Although our study ultimately included books in multiple languages, such as English, Japanese, and French, it notably did not encompass any Arabic books, which are critically relevant in both the UAE and Saudi Arabia. The lack of representation of such books may be due to a miniscule representation on the American Amazon site — amazon.com — or due to a sampling bias of Common Crawl.

Discussion

In this section we discuss how our results inform multiple high-level research questions concerning how Amazon performs censorship of products on amazon.com.

How does Amazon choose which products to censor at scale?

We found that Amazon assigned products and regions to different masterlists — which we named “baseline”, “moderate”, and “high” — that were used to restrict shipments of products to those masterlists’ regions. There are different automated methods by which Amazon could be assigning products to these masterlists or to the thematic categories (e.g., “LGBTIQ”, “Occult”, etc.) composing them. We found some words tend to commonly appear in the titles of censored products or their descriptions (e.g., “gay”, “demons”, “tarot”, etc.). This finding would support that Amazon is using a simple keyword matching approach to censoring products such as that any product with “LGBTIQ” in its title should be censored. As further evidence, Amazon censored products which coincidentally contained these keywords like Nietzsche’s Gay Science, suggesting that it was triggered by the presence of a keyword like “gay” and that Amazon had no deeper, holistic understanding of the products it evaluated.

However, we were unable to find any set of restricted keywords or other simple filtering rules that completely explain the content that we found both available and restricted. For instance, searching Amazon for books containing “gay” in the title, we did find a small number of matching paperback and hardcover books that Amazon allowed us to ship to both the UAE and Saudi Arabia. As such, if Amazon performed keyword filtering, while “gay” would seem to be a likely choice for a keyword that they would filter, we found counterexamples to such filtering suggesting that either keyword filtering is not the means through which Amazon filters products or that there are other variables in play. However, as we had also found that a small number of products could be shipped to Belarus and Russia, despite Amazon’s claim that they would cease shipping to those countries, if Amazon is intending to restrict all books with “gay” or other keywords in the title or description from being shipped to certain regions, then the same intermittent failure preventing Amazon’s categorical restriction of products being shipped to Belarus and Russia may also be affecting its restriction of other products to other regions. Thus, even if Amazon is restricting all products containing “gay” or other keywords in their titles or descriptions to certain regions, we might still expect some products to slip through their filtering.

Another possibility is that Amazon employs more sophisticated machine learning (ML) or natural language processing (NLP) methods to identify restricted products. Such filtering would also explain why a small number of products containing “gay” or “LGBT” in their title or description are not censored (e.g., as of August 14, 2024, This Book Is Gay and Not All That Glitters: An LGBT Literary Fiction Novel are available in both the UAE and Saudi Arabia). However, the few books that we found with “gay” or “LGBT” in their titles that we could ship to the UAE and Saudi Arabia featured descriptions which were openly and explicitly concerning LGBTIQ topics, and thus an ML or NLP algorithm should have restricted them regardless. These findings and our observations of false positives such as Gay Science suggest that either an ML or NLP approach is not used or that it is wholly ineffective.

In sum, while we cannot conclusively identify the exact method through which Amazon identifies products to censor, we can identify the method as being overly sensitive to the presence of certain keywords which trigger false positives.

Are these products also censored on the UAE and Saudi Arabia dedicated storefronts?

As we discussed earlier, there has been previous reporting on how Amazon censors products on the UAE Amazon site — amazon.ae. Our study does not analyze censorship on this site or the site of any other region except for the American site — amazon.com, finding that users of the American site, including Americans, are subjected to restrictions imposed by Amazon on where they can ship products. These restrictions are overly broad, and Amazon provides misleading explanations for why users cannot ship these products to different regions. Below we briefly analyze censorship on the UAE (amazon.ae) and Saudi Arabia (amazon.sa) dedicated storefronts and compare it to our findings from analyzing amazon.com.

Search results for “lgbt” from Saudi Arabian (top) and the UAE (bottom) Amazon store fronts.
Figure 19: Search results for “lgbt” from Saudi Arabian (top) and the UAE (bottom) Amazon store fronts.

We first sought to understand if there was censorship of search queries on the UAE and Saudi Arabia regional sites. Performing search queries on the regional UAE site, we found no results for the following LGBTIQ search terms: “gay”, “lesbian”, ”transgender”, “LGBT”, “queer”, or “bisexual” (see Figure 19). Search terms related to other topics such as sexuality, the occult, or Christianity return results and are seemingly not subject to search query censorship. Thus, this censorship appears to target LGBTIQ in the UAE exclusively.

Unlike with the UAE, we did not find that the Saudi Arabia dedicated storefront performed the same type of censorship where all results for certain search queries would be censored. However, in Saudi Arabia search results for LGBTIQ topics were not relevant to our queries or, if relevant, the results did not contain the queried LGBTIQ search term. This finding may be explained by the censorship of product listings themselves on the Saudi Arabia storefront versus the additional censorship through search queries.

To explore whether and how the UAE and Saudi Arabia censor product listings, we compared the search results from Amazon’s site search and Google Search. We did this as Amazon’s search results may be censored in these regions, and Google Search provides ground truth search results. Specifically, using Google we searched for site:amazon.ae intitle:”lgbt” and site:amazon.sa intitle:”lgbt”, queries designed to return all pages from the UAE amazon.ae site and the Saudi Arabia amazon.sa sites with “lgbt” in the title.

There were five results for the UAE site: three are for products which contain either “L.G.B.T.” (LGBT with each letter followed by a period) or “L G B T” (LGBT with each letter separated by spaces) in their title. Such punctuation or spacing may be enough to evade a censorship filter on amazon.ae if it is strictly looking for the string “LGBT” without punctuation or spaces. The other two results were links to Amazon search results for “lgbt bracelet”, which, unlike searches for “lgbt” by itself, returns results, suggesting that Amazon’s amazon.ae search filter is quite naive and only filters according to strings exactly equal to restricted keywords as opposed to strings containing restricted keywords. We note though that these searches for “lgbt bracelet” do not return results for products mentioning “LGBT”. These findings suggest that, in addition to censoring search queries, the UAE regional site censors the listing of products by the presence of LGBTIQ keywords in their titles or descriptions.

Similarly, there were two results for the Saudi Arabia site: one was the page for the Amazon search results for the query “lgbt”. Since Amazon does not appear to filter search results themselves on amazon.sa, there are search results, but none of them related to LGBTIQ products. The other is for a product titled “Proud Mama LGBT Garden Flag,Double-Sided Flag for Lawn Yard Outdoor Decoration 12×18 Inch” but yet the product appears to be for a Dwight Yoakam sleeveless shirt, unrelated to LGBTIQ topics. Since all other evidence points to the Saudi Arabia regional site also censoring the listing of products by the presence of LGBTIQ-related keywords, this mismatch between product title and product nature may somehow explain how this “LGBT”-mentioning product was for sale on amazon.sa, especially if Amazon’s censorship filter is only applied on an initial product listing or periodically rather than on any change to the product’s title or description as well.

Overall, our exploratory analysis of these dedicated storefronts finds commonalities between the censorship on amazon.com to the UAE and SA versus the censorship on amazon.ae and amazon.sa in that LGBTIQ-themed products are censored in each case. However, more work is needed to understand how Amazon censors on these dedicated storefronts versus the products on the American storefront shipped to other regions.

Does Amazon’s censorship extend to Kindle books?

Although our study pertained to testing the censorship of physical products, including the physical editions of a book, and whether they can be physically shipped to different regions, our preliminary testing of Kindle electronic books has found that Amazon applies censorship to Kindle books and that this censorship is based on the user’s Kindle region as specified in a user’s account (see Figure 20).

In Amazon’s account preferences, an option to change a user’s Kindle region.
Figure 20: In Amazon’s account preferences, an option to change a user’s Kindle region.

More work is required to understand how Amazon’s electronic book censorship compares to that of its physical product censorship, e.g., if it encompasses the same books and the same regions. We leave this endeavor to future work.

How does Amazon’s censorship compliance compare to other companies?

Much of our previous work has studied the way that Internet platform operators respond to the censorship demands of authoritarian governments. We have previously studied how Apple censors its product engraving service, finding that, much like Amazon’s, the censorship was overbroad and inappropriately applied to multiple regions. A follow-up study found that Apple had made improvements, eliminating problematic political censorship in Taiwan but maintaining it in Hong Kong.

Microsoft too has inappropriately applied censorship to different world regions. Although Microsoft’s Bing globally censored the iconic photos of the Tiananmen Square “Tank Man” for a period of days in June 2021, our research found that Microsoft also pervasively and globally censored Bing search suggestions for Chinese political sensitivity. This censorship included broad references to names, such as Xi Jinping, and dates, such as June 4, the day of the Tiananmen Square Massacre, and occurred across the globe, including in the United States and Canada. This censorship occurred for a period of over eight months and ended only following the publication of our findings.

We have also studied how Microsoft has complied with Chinese censorship demands on its other platforms, detailing political censorship on Skype chat, Bing search results, and Bing Translate translations. Much like Amazon’s, Microsoft’s censorship can be secret, subtle, and misleading. As a notable example, we found that Microsoft’s Bing search platform silently returns search results from Chinese state media and Chinese government websites if entering politically sensitive queries such that a user might not even know that their search results are being manipulated in this manner. Similarly, Amazon fails to inform users when they are censored and gives misleading messages about products being (e.g.) “temporarily out of stock”. Microsoft ended its censorship of Skype in China following extensive media coverage of our findings.

What are the human rights implications of Amazon’s book bans?

Under the United Nations Guiding Principles on Business and Human Rights (UNGPs), Amazon is required to conduct rigorous human rights due diligence and has a responsibility to respect human rights. This due diligence includes identifying, preventing, mitigating, and addressing adverse impacts arising from their operations, including the right to access diverse and unrestricted information. Although not legally bound by international human rights treaties like the International Covenant on Civil and Political Rights (ICCPR), Amazon has publicly committed to upholding human rights standards through the UNGPs.

Amazon’s role as a dominant global retailer entails assessing how restrictions on book availability may limit access to important ideas and perspectives, particularly for marginalized communities such as the LGBTIQ community. This responsibility aligns with Article 19 of the Universal Declaration of Human Rights (UDHR), which guarantees the right to “seek, receive, and impart information and ideas of all kinds”. This principle has increasingly been interpreted to cover access to information provided by private entities with substantial market power, obligating Amazon to address foreseeable harms linked to its decisions on restrictions of books.

Despite Amazon’s stated support for the UNGP and public commitment to LGBTIQ rights, our research reveals troubling contradictions. We showed how Amazon censored the shipment of LGBTIQ-related books, restricting access to literature crucial for LGBTIQ people’s experiences and expression of their identities, to fourteen countries: Jordan, Egypt, Bahrain, Oman, Kuwait, Qatar, Lebanon, Papua New Guinea, Maldives, Zambia, Seychelles, Brunei Darussalam, Saudi Arabia, and the UAE. This censorship not only limits the access of LGBTIQ individuals to content that matters to them, but it also limits the broader society’s right to engage with diverse viewpoints on issues of sexual orientation and identity. Censorship of these materials fosters a limited, manipulated narrative on queerness, which poses significant harm to both marginalized communities and societal awareness of these issues.

Further, by restricting access to these books, Amazon infringes on the rights of readers, authors, and publishers. Principle 21 of the UNGP mandates that companies transparently communicate how they address potential human rights impacts, ensuring affected parties are informed of Amazon’s rationale for restrictions. However, Amazon has not provided clear, consistent explanations regarding the unavailability of specific books, issuing ambiguous or misleading messages when customers attempt to purchase or ship restricted titles. This lack of transparency fails to meet the UNGP’s requirement for open and accurate communication.

Additionally, Principle 29 of the UNGP obligates companies to provide effective mechanisms for remediation, allowing individuals adversely affected by corporate practices to seek redress. Despite this, Amazon has not provided accessible grievance mechanisms or appeals processes to allow readers, authors, or publishers to contest these arbitrary restrictions on their right to information and freedom of expression. The absence of such mechanisms leaves affected stakeholders without recourse, violating the fundamental principles of due process and access to remedy as outlined in the UNGPs.

In our study, we found that Amazon’s censorship exceeds that which is legally required or even expected. As one example, we found that Amazon applied LGBTIQ censorship to books shipped to Seychelles. While male/male sexual acts in Seychelles were once illegal, such acts were decriminalized in 2016, and the country’s legislative body has made additional strides to protect the LGBTIQ community, including the 2006 criminalization of employment discrimination on the basis of sexual orientation and the 2024 introduction of LGBTIQ hate crime legislation. However, in addition to regions inappropriately affected by censorship, our study found numerous products inappropriately captured by it, such as Nietzche’s Gay Science, a book on breast cancer, and rainbow-colored candy.

Amazon’s approach to censoring LGBTIQ-related books reflects a significant gap between the company’s human rights commitments under the UNGPs and its operational practices. This issue is exacerbated by the absence of transparency and effective grievance mechanisms, leaving impacted people without a clear understanding of Amazon’s censorship policies and with no resources or options for remedy.

Questions for Amazon

On October 23, 2024, we sent a letter to Amazon with questions concerning Amazon’s shipment restriction policies, committing to publishing their response in full. Read the letter here.

As of November 25, 2024, Amazon has not replied.

Recommendations to Amazon

We conclude by making four recommendations to Amazon to address concerns raised by this report:

  1. Provide transparent and accurate notifications to customers when products are unavailable due to legal restrictions of the destination region. Users should not be given misleading messages that misconvey why their products cannot be shipped.
  2. Inform users of the relevant law(s) applying to the restriction. Users that are advised of laws restricting their purchase can make better informed decisions regarding products that the filter failed to restrict and to identify products that have been incorrectly filtered.
  3. Provide customers a mechanism to flag products that have been improperly classified as being illegal in the destination region. Users should be enabled to flag products that have been erroneously restricted so that Amazon can review and, when appropriate, remove restrictions from them.
  4. Review the regions to which each category of censorship is applied. Regions should not be pigeonholed into censorship blocklists, and each censorship category should be regularly reviewed for relevance in each region to which it is applied (e.g., LGBTIQ censorship in Seychelles).

Data

We make our data available here.

Acknowledgments

We would like to thank Jedidiah Crandall, Irene Poetranto, and Adam Senft for valuable peer review; and Aly Bruce copyediting; and Mari Zhou for graphics assistance and report art; and Snigdha Basu and Aly Bruce for communication support. We are also grateful to Rasha Younes whose conversation with us concerning Amazon’s search-based censorship on amazon.ae and amazon.sa inspired us to perform this research. Research for this project was supervised by Ron Deibert.

]]>
Chinese censorship following the death of Li Keqiang https://citizenlab.ca/2023/11/chinese-censorship-following-the-death-of-li-keqiang/ Tue, 21 Nov 2023 20:59:34 +0000 https://citizenlab.ca/?p=80230 Key findings
  • As part of our ongoing project monitoring changes to Chinese search censorship, we tracked changes to censorship following Li Keqiang’s death across seven Internet platforms: Baidu, Baidu Zhidao, Bilibili, Microsoft Bing, Jingdong, Sogou, and Weibo.
  • We found that the presence of some keyword combinations in search queries triggers hard censorship (when all results are censored) whereas others trigger soft censorship (when results are only allowed from whitelisted sources).
  • Motivations behind censorship were complex and seemingly paradoxical, as terms both criticizing and memorializing Li were targeted.
  • Our results demonstrate China’s ongoing efforts to push state-sanctioned narratives concerning politically sensitive topics, impacting the integrity of the online information environment.

Introduction

On October 27, 2023, Li Keqiang, the former Premier of China, passed away due to a heart attack. His death invited commentators to compare Li’s legacy to that of Xi Jinping, while in China public memorials for Li were alternately permitted and restricted. This report documents our discovery of Li Keqiang-related censorship rules on multiple Chinese platforms introduced in light of Li’s death. We found censorship rules relating to speculation over Li’s cause of death, aspirations wishing Xi had alternatively died, memorials of Li’s death, recognition of Li’s already diminished status in the party, and commentary on how Li’s death cements Xi’s political status.

Background

Li Keqiang (1955–2023) served as the Party Secretary of the provinces of Henan and later Liaoning before being appointed Vice Premier under former General Secretary Hu Jintao in 2007. Following Xi Jinping taking office as General Secretary in 2012, Li was promoted to Premier, a role he held from 2013 to 2023. With a PhD in economics from Peking University, some saw Li as a “technocrat” and a “moderate voice” within an otherwise conservative Xi administration. Over his ten years in office, Li’s power was circumscribed as Xi removed allies of Jiang Zemin and members of Hu Jintao’s Youth League faction and filled the government with loyalists. The replacement of Li Keqiang with former Shanghai Party Secretary and Xi ally Li Qiang at the 20th National Congress in 2023 signaled to some “the end of collective leadership” under Xi’s personalistic rule. Following Li’s death, obituaries published outside China referred to Li as “less influential than his immediate predecessors” and “the least powerful premier in the history of the People’s Republic of China.”

Following Li’s passing on October 27, Xi Jinping and other senior leaders attended Li’s funeral at Beijing’s Babaoshan Revolutionary Cemetery. While People’s Daily eulogized Li as “a time-tested and loyal communist soldier,” Li’s death came during a period of growing malaise within China. Xi has deepened personal control over the Communist Party of China (CPC) during his third term in office, and high youth unemployment and a declining property sector have contributed to public concern about China’s economy. Against this backdrop of tightened political control and economic uncertainty, many in China remembered Li as a pragmatic economic planner with a human touch.

In the past, the death of prominent figures like Li Keqiang have provided Chinese people with opportunities for protest and dissent. The death of Premier Zhou Enlai led to a million people gathering in Beijing’s Tiananmen Square in April 1976 to mark his passing and obliquely criticize Mao Zedong and the Gang of Four. In April 1989, public mourning in Tiananmen for former General Secretary Hu Yaobang grew into a larger protest movement demanding political, economic, and social reform. In February 2020, the death of COVID-19 whistleblower Dr. Li Wenliang produced an outpouring of anger online against authorities who had admonished Dr. Li for spreading “false information” about the emergence of a novel coronavirus in Wuhan.

Given the potential for public grieving to escalate into political activism, the Chinese government has attempted to manage citizens’ responses to Li Keqiang’s passing. Authorities have closely monitored spontaneous memorials in Li’s hometown of Hefei in Anhui Province and universities across China have warned students against gathering to pay respects to the former premier. Controls on public mourning have extended online. State censorship instructions cautioned media platforms against permitting “overly effusive comments” about Li’s death, a potential reference to the satirical use of “high-level black” praise to mask political criticism. The National Radio and Television Administration’s Online Media Department issued similar instructions to online media platforms to promote an “affectionate and orderly” response to former General Secretary Jiang Zemin’s death in November 2022. Despite these controls, Chinese social media users have found creative ways to memorialize Li Keqiang, including visiting the late Dr. Li Wenliang’s Weibo page to offer condolences for “another truth-teller with the surname Li.”

Methodology

In previous work, we designed an ongoing experiment to automatically test for changes in the automated censorship of search queries across seven Internet platforms operating in China: Baidu, Baidu Zhidao, Bilibili, Microsoft Bing, Jingdong, Sogou, and Weibo. To perform this testing, we automatically pull the text of recent news articles from the web, testing these texts on each platform for whether they are censored when searched for and, if so, isolating the exact keyword or combination of keywords in that text that is triggering its censorship. We call the triggering keyword or keywords the censored keyword combination. We found that the presence of some keyword combinations in search queries triggers hard censorship, i.e., the censorship of all results, whereas the presence of other keyboard combinations triggers soft censorship, i.e., the censorship of results from all but whitelisted sources. For web search engines like Baidu or Bing, soft censorship restricts results to only Chinese government websites or state media, whereas for a social media site like Weibo, soft censorship restricts results to being only from those accounts with a sufficient level of verification. Whenever we discover a new censored keyword combination, we record it, the platform on which it was censored, the date and time of discovery, as well as whether it was hard- or soft-censored. For the full details of our methodology, please see our previous work. Our data collection began January 1, 2023, and is ongoing as of the time of this writing.

In this work, we analyze keyword combinations discovered since the announcement of Li Keqiang’s death. Specifically, we look at those introduced in a period from midnight October 27 to 5pm October 31, 2023, UTC.

Findings

Following Li Keqiang’s death on October 27, we found a significant uptick in censorship surrounding Li on most platforms that we monitor. This finding is notable as Li’s name was, similar to other senior CPC leaders, already broadly censored on most platforms before his death. For example, Baidu, Bing, and Weibo already broadly soft-censored any search query containing Li’s given name, 克强 (Keqiang), and Jingdong hard-censored and Sogou soft-censored his full name 李克强 (Li Keqiang). Therefore, new censorship rules that we discovered on these platforms were necessarily either even broader than the existing rules or targeted content that managed to avoid mentioning, depending on the platform’s pre-existing rules, either Li’s given or full name.

Below we highlight and categorize many of the new censorship rules that we discovered. While in many cases we can say confidently that the rules were added since Li’s passing, since there would be no reason for them to have been censored before, in other cases, it is also possible that we may be unearthing old rules that we had not previously discovered due to never having previously tested content that triggers them.

Cause of death

Much of the censored content concerned Li’s cause of death or implicated Xi in Li’s death. For instance, Sogou soft-censored “克强 + 死因” (Keqiang + cause of death), “總理 + 死因” (prime minister + cause of death), and “克强 + 被害” (Keqiang + harmed), which concern the cause of Li’s death and whether he was killed. Sogou’s soft censorship of “习总 + 干掉” (General Secretary Xi + get rid of) and Weibo’s hard censorship of “近平 + 暗杀” (Jinping + assassination) target discussion suggesting that Xi had Li killed, although those rules would equally censor conversation calling for Xi to be killed, and therefore we cannot exactly know the rules’ original motivation.

Wishing it were Xi instead

While much of the censorship targeted the implication of Xi in Li’s death, other censorship targeted communication wishing that it were Xi instead of Li who passed. Some censorship targeted direct wishes for Xi to die. For instance, Sogou simply soft-censored “卒习” (die Xi). Baidu conversely hard-censored “习近平 + 祈翠” (Xi Jinping + pray Xi dies). While the character “翠” literally means “jade,” its radicals when decomposed form “习习卒” (Xi Xi die) and can therefore be understood as a way to call for Xi’s death while trying to avoid censorship filters.

Other censorship rules did not target Xi by name, but nevertheless the intention of these rules is understood. For example, Weibo soft-censored “该死的没死” (the one who should die isn’t dead) as well as “好人不长命” (good people don’t live long). Many platforms also have censorship rules targeting references to “可惜不是你” (unfortunately not you), which is also the name of a popular song by Malaysian singer Fish Leong. Weibo soft-censored all references to the song, whereas Sogou only soft-censored search queries if the song’s name occurred in the presence of other, related words: “克强 + 可惜不是你” (Keqiang + unfortunately not you), “为什么敏感 + 可惜不是你” (why is it so sensitive + unfortunately not you), and “可惜不是你 + 下架” (unfortunately not you + censored). The last two are significant in that content moderators are censoring queries by users attempting to ascertain why the name of the song is censored. Following the assassination of former Japanese Prime Minister Shinzo Abe in July 2022, some social media users had also previously used the title of the song to obliquely refer to Xi.

Places and memorials

Many place names and references to in-person memorials for Li were censored in response to his death. Weibo soft-censored “曙光医院” (Shuguang Hospital) and Sogou soft-censored “克强 + 曙光医院” (Keqiang + Shuguang Hospital), referring to the hospital in Shanghai in which Li reportedly passed. It is not clear why the name of the hospital would be particularly sensitive. Content moderators may have interpreted queries about the hospital as attempts to ascertain other information about the cause of Li’s death, or authorities may have been concerned that the hospital could become a potential place for a memorial. Following the death of Jiang Zemin in November 2022, police reportedly assembled outside the hospital in which the former general secretary had been receiving care.

Other rules targeted memorializing Li. For instance, Sogou soft-censored “校园 + 聚集性悼念” (campus + collective mourning). As discussed in the “Background” section, in the past, collective mourning has provided Chinese citizens with an opportunity to criticize the state. Chinese authorities at the national and subnational level adopt different strategies in response to mass protest, including suppressing dissent and offering concessions. Chinese citizens have continued to engage in public dissent under the Xi administration, despite strong controls on collective action.

Sogou also soft-censored “真诚 + 忍让 + 善良” (sincerity + tolerance + kindness), targeting a quote from a letter Li wrote in 1982 to a graduate of Peking University: “Some people never win with force, but they move people with sincerity, tolerance, and kindness. In fact, these are the real strong people in life.” The motivation for censoring this quote could be concerns that Li’s words could be interpreted as hinting at Xi, whose conservative leadership is known for broad social controls, “strongman rule,” and an anti-corruption campaign that has doubled as a purge of his political opponents. Similarly, Weibo soft-censored the aphorism “人在做 + 天在看” (what people do + Heaven sees). Because this aphorism is commonly understood to mean that the deeds of both good and bad people will be known, content moderators may interpret the saying as indirectly praising Li and criticizing Xi.

Li’s former status

Some of the censorship highlighted Li’s already diminished status in the Party even before his death. Baidu hard-censored “弱势总理 + 习近平” (weak prime minister + Xi Jinping), a direct reference to Xi Jinping’s reduction of the authority of the office of prime minister during Li’s tenure. Sogou soft-censored queries containing “架空 + 总理” (figurehead + prime minister), another reference to Li’s restricted authority as prime minister.

Xi’s cemented status

While some censorship targeted queries concerning Li’s former status, other censorship targeted how Li’s death relates to Xi’s status as China’s paramount leader. As an example, Baidu hard-censored “习近平 + 集权于一身” (Xi Jinping + centralization of power), a reference to Xi’s personalistic rule. Some censorship made reference to Xi as an emperor. As examples, Weibo hard-censored “当今圣上” (reigning emperor) and soft-censored “圣上” (your majesty), the former term being one which prior to Li’s death had been used to refer to Xi.

More generally, Sogou soft-censored “大選 + 主席” (general election + chairman). Although we had already discovered the simplified Chinese version of the rule prior to Li’s death, we only discovered the one made up of traditional Chinese characters after. The general secretary of the CPC, the senior most role in the party-state, is not directly elected but is instead elected by the Central Committee. While the Chinese government has promoted “whole process democracy” as an alternative to liberal democracy, discussion of competitive elections for senior leaders is politically sensitive in China. Sogou also soft-censored “中共大方向不改 + 就没有出路” (no change to the CPC’s general direction + there will be no way out), a reference to an interview with former Central Party School professor and exiled dissident Cai Xia on the future of China.

Conclusion

As part of our ongoing project monitoring changes to Chinese search censorship across seven Internet platforms, we tracked changes to censorship following Li Keqiang’s death. Motivations behind censorship were complex and seemingly paradoxical, as terms both criticizing and memorializing Li were targeted. In China, criticism of senior leaders is prone to censorship. At the same time, out of a general motivation to prevent mass movement and because some senior leaders may be seen as potential rivals to Xi, censors restrict memorializing senior leaders, especially if doing so appears to challenge the legitimacy of Xi’s rule. Most censorship we discovered was soft censorship, indicating that the censors did not desire to block all results for search queries concerning Li but rather direct users to state-approved content. The hard censorship we documented was often targeting content unconvertible to approved content, such as content calling for Xi’s death or content implicating Xi in an assassination of Li. Despite monitoring Microsoft Bing, the only non-Chinese-operated platform featured in our study, we did not discover any new notable rules relating to Li on this platform. However, our previous work noted that Bing’s rules were the most broad and thus were the least reliant on requiring a large number of highly specific rules to capture sensitive queries. This observation may provide an explanation for why we found no notable rules introduced on Bing in the aftermath of Li’s death.

Our results demonstrate China’s ongoing efforts to push CPC-sanctioned narratives concerning politically sensitive topics. Suppressing natural search results on the web and social media when searching for content concerning Li’s death presents a distorted narrative for users attempting to discover information pertaining to Li and the CPC more broadly, impacting the integrity of the online information environment.

This work builds on our greater effort to automatically track real-time censorship in response to significant political events in China, including Tibetan Buddhist events, the “709 Crackdown” on legal practitioners, the death of Nobel Prize laureate Liu Xiaobo, and the initial outbreak of COVID-19 as well as its continuing spread across the globe. Our work makes use of novel automated methods which we use to exactly and efficiently determine which combination of keywords is responsible for triggering the censorship of sensitive text. Our ongoing monitoring can quickly recognize the introduction of new automated Chinese censorship in response to unfolding world events.

Acknowledgments

We would like to thank a reviewer who wishes to remain anonymous. Research for this project was supervised by Ron Deibert.

Availability

We have made all of the data collected from our ongoing measures beginning January 1, 2023, through the end of this report’s data collection period available here.

]]>
Слепое пятно: как ВК прячет контент от российских пользователей https://citizenlab.ca/2023/07/%d1%81%d0%bb%d0%b5%d0%bf%d0%be%d0%b5-%d0%bf%d1%8f%d1%82%d0%bd%d0%be-%d0%ba%d0%b0%d0%ba-%d0%b2%d0%ba-%d0%bf%d1%80%d1%8f%d1%87%d0%b5%d1%82-%d0%ba%d0%be%d0%bd%d1%82%d0%b5%d0%bd%d1%82-%d0%be%d1%82-%d1%80/ Wed, 26 Jul 2023 13:59:35 +0000 https://citizenlab.ca/?p=79698 The following is the Russian translation of the key findings for the report titled Not OK on VK: An Analysis of In-Platform Censorship on Russia’s VKontakte. Read the full report here.
Замечание о статусе перевода: Это перевод фрагментов оригинального отчёта с английского языка сделанный одним из авторов отчёта. Тем не менее, перевод может содержать неточности. Его задача — дать базовое представление об отчёте и его выводах. В спорных ситуациях и с целью избежать неточностей просим обратиться к английскому оригиналу.

Ключевые выводы отчета

  • В данном отчете рассматривается доступность некоторых видов контента в ВК для пользователей из Канады, Украины и России.
  • Мы обнаружили, что доступ к контенту наиболее жестко ограничен для пользователей из России: в стране заблокировано 94 942 видеоролика, 1569 аккаунтов сообществ и 787 личных аккаунтов.
  • В Канаде ВК преимущественно блокировал доступ к музыкальным клипам и другому развлекательному контенту, в то время как в России мы обнаружили, что ВК блокирует контент, размещенный независимыми новостными организациями, а также контент, связанный с украинскими и беларускими проблемами, протестами и ЛГБТК+ конткетом.
  • В Украине мы не обнаружили контента, который бы избирательно блокировался платформой ВК, хотя сама платформа в той или иной степени блокируется большинством интернет-провайдеров Украины (по указу президента Украины от 2017 года).
  • Примечательно, что за восемь месяцев, прошедших после вторжения России в Украину в феврале 2022 г., количество постановлений о блокировке, вынесенных в отношении VK, увеличилось в 30 раз.
  • В России некоторые виды видеоконтента были недоступны в ВК из-за блокировки аккаунтов отдельных пользователей или сообществ, которые их размещали. Эти видео чаще всего содержали критику Владимира Путина или российского вторжения в Украину
  • Кроме того, мы обнаружили широкую блокировку в России ЛГБТИК контента по ключевым словам.
  • Мы собрали более 300 юридических обоснований, на которые ссылалась компания ВК, оправдывая блокировку видео в России.

ВВЕДЕНИЕ

Китайские социальные сети, известные повсеместной политической и религиозной цензурой внутри страны, по-разному относятся к своим китайским и некитайским пользователям. Если многие платформы, такие как Weibo, применяют политическую цензуру даже к пользователям за пределами Китая, то другие, например WeChat, стремясь привлечь некитайских пользователей, применяют к ним меньшие ограничения. Другие компании, такие как Bytedance, придерживаются подхода, при котором внутри Китая (Douyin) и за его пределами (TikTok) существуют разные платформы.

Российские соцсети также известны своей политической цензурой. Однако недостаточно изучено то какие именно механизмы цензуры они используют, какие темы подвергают цензуре и распространяются ли эти механизмы на пользователей за пределами России.

Цензура контента в российских соцсетях осуществляется с помощью “правовых механизмов”, например, судебных решений. Для блокировки сайтов в суд могут обратиться различные государственные (например, Роскомнадзор, Генеральная прокуратура) и негосударственные структуры (например, Росмолодежь), которые обычно апеллируют к одному из многочисленных российских законов, регулирующих содержание Интернета. Эти законы часто содержат расплывчатые формулировки запрещенного контента (“с нарушением установленного порядка”, “нецензурная брань”, “неуважение к… органам, осуществляющим государственную власть в Российской Федерации”, “пропаганда нетрадиционных сексуальных отношений и (или) предпочтений” и т.п.

Эти законы используются для обоснования политической цензуры Интернет-контента, в частности контента, критикующего Путина или других российских руководителей, а также для обоснования ограничения прав ЛГБТКИ-сообщества. Кроме того, согласно закону, вступившему в силу в феврале 2021 г., платформы социальных сетей обязаны осуществлять блокировку проактивно, а не только в ответ на судебные постановления.

Предыдущие исследования, посвященные китайской цензуре в социальных сетях, показали, что передача ответственности за принятие решений о блокировке на частные компании приводит к тому что контент по-разному блокируется в разных компаниях. При этом платформы часто “перебарщивают” с блокировкой, чтобы убедиться, что они охватили все базы и избежать правовых последствий за недостаточно эффективную блокировку контента.

В данном исследовании нас интересует оценка того, как ВК осуществляет политическую цензуру в условиях полномасштабного вторжения в Украину, начавшегося в феврале 2022 года. Наше исследование включает в себя определение того, какие механизмы использует ВК для обеспечения цензуры, какой тип контента подвергается цензуре и распространяется ли эта цензура на пользователей за пределами России. В частности, мы измеряем доступность контента в Вк из разных стран и точек обзора, чтобы выявить случаи дифференцированной цензуры, т.е. случаи когда контент подвергается цензуре в одном регионе, но не в другом.

Это позволяет, например, определить, какой контент виден в Канаде, но не виден в России, и наоборот. В данном отчете мы сосредоточимся на сравнении доступности контента из России, Украины и Канады.

Отчёт построен следующим образом: в разделе “Методология” мы подробно описываем наши методы выявления различий в цензуре ВК в разных странах, а в разделе “Экспериментальная установка” описаны условия, в которых мы использовали эти методы. Далее, в разделе “Результаты”, мы раскрываем полученные нами данные о повсеместной политической и социальной цензуре, которую ВК применяет к пользователям в России. В разделе “Ограничения” мы указываем на ограничения нашего исследования, и, наконец, в разделе “Обсуждение” мы обсудим, как наши результаты способствуют лучшему пониманию цензуры в Интернете в России и как российская цензура в социальных сетях сопоставляется с цензурой в других странах.

Методология

В данном разделе подробно описывается методика измерения дифференцированной цензуры в ВК в Канаде, Украине и России. Мы проводили исследование без регистрации и взаимодействия с какими-либо учетными записями пользователей на платформе ВК. Вместо этого мы проверяли доступ удалённо используя сетевые точки обзора (vantage points) в регионах, которые мы выбрали для сравнения. Это позволило нам сравнить доступность контента в ВК в каждом регионе.

Этот метод позволяет нам проводить тестирование без использования SIM-карт и телефонных номеров, снизить риски для исследователей и избежать этических проблем, связанных с созданием или передачей контента через платформу.

Для проверки дифференцированной цензуры в ВК, а также для обеспечения разнообразной выборки популярных, легко перечисляемых тем для запросов, мы сделали предварительную выборку из заголовков статей Википедии. Для начала мы выбрали следующие шесть языковых редакций Википедии: Русская, Украинская, Белорусская, Грузинская, Чеченская и Казахская. Мы выбрали эти языковые редакции потому, что ВК является российской платформой, а эти языки широко распространены как в России, так и в прилегающих к ней регионах.

Для каждого из этих языковых изданий Википедии мы отсортировали статьи по общему количеству просмотров за январь, февраль и март 2023 года, расположив их в порядке убывания популярности. При тестировании мы отбирали статьи из этих шести отсортированных списков по принципу круговой выборки, так что тестировалось одинаковое количество статей из каждого языкового издания Википедии.

В данной работе нас интересует сравнение доступности контента ВК в Канаде (либеральной демократической стране, где находится Citizen Lab), а также в России и Украине, странах, занимающих первое и третье места по числу посетителей ВК соответственно. ВК позволяет искать видео, сообщества и людей. Таким образом, для каждого ключевого слова которое мы тестировали, мы одновременно выполняли девять поисковых запросов на сайте ВК. Для каждой из стран – Канады, Украины и России – и для каждой из целей поиска – видео, сообществ и людей мы задавали запрос на название статьи по этой цели поиска в данной стране. Для каждой из этих девяти комбинаций мы фиксировали количество результатов поиска по данному запросу.

Поскольку мы заметили, что в некоторых случаях ВК ошибочно выдает меньшее количество результатов, чем обычно, мы применили следующую процедуру повторного тестирования. Через 24 часа после выполнения запроса мы повторили поиск по этому запросу с той же точки обзора. После этого фиксировалось наибольшее из двух значений.

Далее мы попытались определить некий порог, при котором можно было бы считать подозрительно разными результаты поиска в двух разных регионах. Мы заметили, что при сравнении абсолютных чисел подозрительно различаются запросы с большим количеством результатов, а при сравнении по процентной разнице подозрительно различаются запросы с малым количеством результатов. Поэтому для проверки того, что количество результатов поиска X из одного региона подозрительно меньше, чем количество результатов Y из другого региона, мы используем следующую статистическую эвристику.

Анализируя первые результаты между Канадой и Украиной, мы сразу заметили, что они совпадают по модулю случайных колебаний. На ранней выборке мы увидели, что в 132 тестах было восемь, которые имели разное количество результатов, каждый с разницей в единицу. Поскольку мы не знали переменных, влияющих на эти случайные колебания (т.е. пропорционален ли размер колебаний количеству результатов?), мы выбрали 8/132 в качестве четкой верхней границы для доли результатов, которые мы можем обнаружить случайно отсутствующими. Используя односторонний критерий хи-квадрат, мы провели проверку разности пропорций, а именно гипотезу о том, что (x – y) / x ≤ 8/132. Если мы отвергаем эту гипотезу с вероятностью p < 0,001, что такая разница в пропорциях могла возникнуть случайно, то делаем вывод, что подозрительно мало результатов X по сравнению с Y.

Для результатов поиска, которые показались нам подозрительно малыми, мы дополнительно выяснили, какой контент отсутствует в их результатах. Поскольку ВК позволяет раскрыть только до 999 результатов поиска по одному запросу, мы ограничили наше исследование запросами, содержащими менее 1000 результатов. Для любого числа результатов X и числа результатов Y, если X < 1 000, а Y < 1 000, и при этом либо X, либо Y подозрительно мало результатов по сравнению с другими, мы загружали все результаты поиска для X и Y и фиксировали, какие результаты отсутствуют в каждом из них.

Для каждого результата, отсутствующего в наших исследованиях, чтобы лучше понять, почему это видео, сообщество или человек отсутствуют в результатах по данной стране, мы пытались получить доступ к этому результату из того региона, в котором он отсутствовал. Например, мы пытались получить доступ к результату с помощью десктопной (vk.com) и мобильной (m.vk.com) версий сайта VK. Мы фиксировали все сообщения об ошибках или другие заглушки, которые выдавались пользователю.

Детали эксперимента

Описанная выше методика была реализована на языке Python с использованием модулей aiohttp и SciPy и выполнялась на машине с операционной системой Ubuntu 22.04. Эксперимент проводился с 17 апреля по 13 мая 2023 года. Канадские измерения проводились из сети Университета Торонто. Российские и украинские измерения проводились через туннели WireGuard, предоставленные популярным VPN-сервисом, предлагающим российские и украинские точки обзора. Учитывая, что ВК в той или иной степени заблокирован в большинстве украинских сетей, перед проведением экспериментов мы подтвердили, что наша украинская точка доступа имеет доступ к ВК (анализ блокировки ВК в Украине см. в Приложении А).

Результаты

Мы составили список из 127 187 самых популярных статей в каждой из Википедий на русском, украинском, белорусском, грузинском, чеченском и казахском языках. В совокупности эта выборка составила 708 346 уникальных названий статей. Измерив, какой контент был заблокирован в поисковых запросах ВК по этим названиям, мы обнаружили дифференцированную блокировку видео в Канаде, а также видео, сообществ и личных аккаунтов в России, хотя мотивы блокировки видео в Канаде и России оказались совершенно разными, как мы объясним ниже. Мы также обнаружили, что в России результаты поисковых запросов по сообществам и людям подвергались цензуре по ключевым словам, в то время как в Канаде такой фильтрации не было.

Примечательно, что мы не обнаружили внутриплатформенной дифференцированной блокировки, осуществляемой ВК в Украине по сравнению с Канадой или в Украине по сравнению с Россией. Поскольку мы не обнаружили контента, который был бы доступен в Канаде или России, но был бы недоступен в Украине, в оставшейся части данного раздела мы сосредоточимся на сравнении дифференциальной блокировки в России и Канаде и наоборот.

Сначала мы даем обзор наших результатов и подробно описываем различные механизмы блокировки, используемые VK. Затем мы используем методы анализа данных, чтобы лучше понять обнаруженный нами заблокированный контент, например, какой тип контента был заблокирован, какие события привели к его блокировке или какие юридические обоснования приводил VK при блокировке.

Обзор и механизмы блокировки

Из полученных результатов мы сделали вывод, что ВК использует несколько методов блокировки. Основным методом было блокирование или удаление определенных результатов поиска. Например, для контента, отсутствующего в Канаде, мы не увидели удаленных личных аккаунтов. Девять сообществ отсутствовали в результатах поиска в Канаде, однако сами результаты поиска были доступны в Канаде, если набрать URL-адреса страниц сообществ, и в них не было ничего, что могло бы подсказать, почему они могли быть удалены из результатов поиска. Поэтому мы считаем, что это были ложные срабатывания, поскольку они отсутствовали в результатах поиска по вполне естественным причинам, например, из-за того, что различные балансировщики нагрузки или кэширующие серверы имели несовпадающее представление об одних и тех же данных.

Если не считать небольшого количества ложных срабатываний, то в Канаде оказались недоступными 2613 видео. Они отсутствовали в результатах поиска и выдавали либо сообщение “Это видео недоступно в вашей стране”, либо блокировку “Звук видео недоступен”. Как оказалось, все эти видео были популярными спортивными, музыкальными и другими развлекательными видеороликами, размещенными обычными пользователями, и, учитывая объяснения, приведенные в сообщениях о блокировке, эти видеоролики, скорее всего, были заблокированы в Канаде за нарушение авторских прав. Однако в России эти видеоматериалы были по-прежнему доступны. Наша гипотеза заключается в том, что такое разное отношение к контенту, нарушающему авторские права, может быть объяснено общим непоследовательным способом применения закона об авторском праве в ВК в разных регионах.

Украина Канада Россия
Видео Не найдено Видео нарушающие авторские права заблокированы Когда сообщество или пользователь опубликовавший видео заблокированы, это видео тоже заблокировано
Сообщества Не найдено Не найдено (1) Блокировка сообществ по ключевым словам связанным с ЛГБТКИ (2) Блокировка сообществ по политическим темам
Люди Не найдено Не найдено (1) Блокировка пользователей по ключевым словам связанным с ЛГБТКИ (2) Блокировка пользователей по политическим темам
Таблица 1: Для каждого региона и вида контента указаны обнаруженные нами типы блокировки
При поиске сообществ и людей мы заметили, что ВК отключает результаты поиска, если поисковый запрос содержит определенные ключевые слова, связанные с ЛГБТКИ (см. рисунок 1 и таблицу 2 с перечнем ключевых слов, которые, как мы обнаружили, вызывают фильтрацию). Если при поиске сообществ и людей цензура по ключевым словам применялась, то при поиске видеороликов она, по-видимому, не применялась.
Изображение 1: Поиск по термину ЛГБТ в России блокировал все результаты содержавшие ключевое слово “ЛГБТ”

 

Ключевое слово Английский перевод
gay gay
LGBT LGBT
Геи gay
Гей gay
ЛГБТ LGBT
ЛГБТК LGBTQ
Лесбиянка lesbian
Трансгендер transgender
Фембой femboy
Таблица 2: Ключевые слова по которым блокировался поиск по сообществам и пользователям в России
Помимо фильтрации поиска сообществ и личных аккаунтов по ключевым словам, VK также напрямую блокирует отдельные сообщества и личные аккаунты, скрывая их из результатов поиска и выводя сообщение о блокировке при просмотре страницы аккаунта. Фактически блокировка сообществ и личных аккаунтов является основным методом цензуры видео в России. Кроме 134 видеороликов, которые не показывали сообщения о блокировке и которые мы считаем ложными срабатываниями, остальные 94 942 видеоролика, пропавшие из результатов поиска, показывали сообщение о блокировке в десктопной версии ВК, например “Это видео недоступно, поскольку его создатель был заблокирован“. Мы смогли доказать, что все эти видео были заблокированы по причине блокировки сообщества или человека, разместившего видео, поскольку при попытке просмотреть сообщество или аккаунт, разместивший эти видео, мы получили сообщение о блокировке, в котором упоминалось решение суда о блокировке.
Изображение 2: Пример заблокированного видео на десктопной версии ВК.
Изображение 2: Пример заблокированного видео на десктопной версии ВК

Для обоснования блокировки сообществ и личных аккаунтов в России мы нашли 336 уникальных сообщений с указанием 303 различных номеров судебных дел. Пример такого сообщения: “Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Центральный районный суд г. Хабаровска – Хабаровский край) от 10.08.2015 № 2-5951/2015”. В тех случаях, когда информация находится в открытом доступе, эти судебные дела, судя по всему, представляют собой запросы на удаление сайтов, поданные российскими прокурорами или другими субъектами, которые в качестве обоснования апеллируют к различным российским законам. Например, в деле, приведенном в вышеупомянутом сообщении о блокировке, российский прокурор апеллирует к статье 4 российского закона “О средствах массовой информации”, чтобы попросить суд вынести решение об удалении контента в ВК, в котором якобы используются нецензурные выражения в адрес Владимира Путина.

Изображение 3: Пример заблокированной страницы сообщества
Изображение 3: Пример заблокированной страницы сообщества

Если в десктопной версии ВК при блокировке сообщества или человека, разместившего видео, постоянно отображалось сообщение “Это видео недоступно, так как его создатель заблокирован”, то в мобильной версии, если видео было размещено заблокированным в России сообществом (в отличие от личного аккаунта), вместо него отображалось сообщение о блокировке заблокированного сообщества. Непонятно, почему возникает такое несоответствие.

В редких случаях мы наблюдали и другие сообщения об ошибках, не являющиеся сообщениями о блокировке: “Пожалуйста, войдите для просмотра этого видео”, “Доступ к этому видео был ограничен его создателем” и “Эта страница либо удалена, либо еще не создана”. Эти сообщения не свидетельствовали о блокировке, а, скорее, имели место, если контент удалялся или ограничивался по усмотрению создателя в процессе тестирования. Поэтому мы не считали контент с такими сообщениями об ошибках заблокированным. В таблице 3 приведена классификация типов сообщений об ошибках, которые мы наблюдали.

Вид сообщения об ошибке Примеры В какой стране наблюдалось? Сообщение о блокировке?
Блокировка по решению суда Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Центральный районный суд г. Хабаровска – Хабаровский край) от 10.08.2015 № 2-5951/2015” Россия Да
Сообщество или пользователь опубликовавший видео заблокированы “Видео недоступно поскольку его создатель заблокирован” Россия Да
Нарушение авторского права “Это видео недоступно в вашей стране”

“Звуковая дорожка к видео недоступна”

Канада Да
Нет доступа к видео “Пожалуйста залогиньтесь чтобы посмотреть видео”

“Доступ к видео был ограничен его создателем”

Канада, Россия Нет
Удалено “Эта страница удалена или еще не создана” Канада, Россия Нет
Таблица 3: Категоризация сообщений об ошибке, примеры сообщений, страна в которой они наблюдались, наличие или отсутствие сообщений о блокировке
На первом этапе мы находили заблокированные сообщества и личные аккаунты по их отсутствию в результатах поиска. Далее мы начали искать их отталкиваясь от того кто опубликовал заблокированные видеоролики. Работая в обратном направлении от заблокированных видеороликов к заблокированным сообществам или личным аккаунтам, которые их разместили, мы обнаружили еще 826 заблокированных в России сообществ и 768 личных аккаунтов. В дополнение к изначально найденным 804 заблокированными сообществам и 19 заблокированным личным аккаунтам, непосредственно отсутствующим в результатах поиска сообществ и людей, мы обнаружили 1569 уникальных сообществ и 787 уникальных личных аккаунтов, заблокированных в России (см. таблицу 4, где представлена сводка всего заблокированного контента).
Канада Россия
Вид заблокированного контента Отсутствие в результатах поиска Видео заблокировано Всего Отсутствие в результатах поиска Видео заблокировано Всего
Видео 2,613 N/A 2,613 94,942 N/A 94,942
Сообщества 0 0 0 804 826 1,569
Люди 0 0 0 19 768 787
Таблица 4: Для каждой страны и каждого вида контента указано общее количество заблокированного контента.
В этой части отчёта мы дали краткий обзор методов блокировки в ВК, а также объема контента, подверженного каждому типу блокировки в разных регионах. В оставшейся части этого раздела мы проведем более глубокий анализ типов контента, блокируемого в ВК. Сначала мы охарактеризуем заблокированные видео в Канаде и России в соответствии с поисковыми запросами, из результатов которых они пропали, людьми опубликовавшими заблокированные видео и качественно опишем случайную выборкой содержимого самих заблокированных видео. Затем мы проанализируем сообщения о блокировке и юридические обоснования, которые ВК выдает пользователю при попытке просмотра заблокированного контента.

Анализ заблокированных видео

В этом разделе мы даём качественное описание заблокированных видеороликов по поисковым запросам, по пользователям опубликовавашим заблокированные видеоролики, а также по случайной выборке содержания самих заблокированных видеороликов.

Какие поисковые запросы выдавали заблокированные видео?

Напомним, что в ходе тестирования мы брали названия популярных статей из Википедии на разных языках и использовали их для поиска видео, сообществ и личных аккаунтов на сайте ВК, чтобы выяснить, блокируются ли эти результаты поиска в конкретном регионе и в какой степени. В данном разделе нас особенно интересуют поисковые запросы, которые привели к обнаружению большого количества заблокированных видео, поскольку такие запросы могут свидетельствовать о типе контента, заблокированного в ВК. Мы называем такие запросы продуктивными.

Видео заблокированные в России

Среди десяти наиболее продуктивных запросов (т.е. тех, по которым было найдено наибольшее количество заблокированных видеороликов) которые привели к обнаружению наибольшего количества заблокированных видеороликов), мы видим, что большинство из них связано с войной в Украине (“Учасники російської війни” [Участники российско-украинской войны], “Пропаганда войны в России”) и международные структуры, занимающиеся посредничеством в урегулировании конфликта (“Генеральна Асамблея ООН” [Генеральная Ассамблея ООН], “Міжнародний суд ООН” [Международный суд]).

Наиболее продуктивными запросами в России можно считать запросы, имеющие косвенное отношение к войне, такие как “Тайное вторжение” — запрос взятый из статьи Википедии о серии комиксов Marvel. Тем не менее, этот запрос выдавал заблокированный контент, связанный с российским “вторжением” в более широком смысле, и “Катэгорыя 24 лютага” [Категория: 24 февраля] – название статьи Википедии о праздниках 24 февраля, которые также являются днем полномасштабного вторжения России на территорию Украины. Есть также продуктивные запросы, связанные с Украиной в целом, например “Поліський район” [Полесский район], бывший административный район Киевской области, и гимн родного города президента Украины Владимира Зеленского Кривий Ріг [Кривой Рог]. Также мы обнаружили один термин, связанный с новостной службой в Беларуси (“БелаПАН”), и один термин, который, казалось бы, не имеет отношения к конфликту (“Чорні троянди” (“Черные розы”)), но при ближайшем рассмотрении оказалось, что это название заблокированного в России сообщества украинской военизированной группировки.

Место в рейтинге Запрос Перевод Описание страницы Язык запроса количество заблокированных видео Всего результатов
1 Чорні троянди Чёрные розы Турецкий вид роз Украинский 493 904
2 БелаПАН БелаПАН Частное новостное агенство в Беларуси Беларуский 476 843
3 Кривий Ріг моє місто Кривой Рог – Мой город Гимн города Кривой Рог Украинский 450 493
4 Пропаганда війни в Росії Пропаганда войны в России Описание пропаганды войны в России Украинский 449 625
5 Генеральна Асамблея ООН Генеральная Ассамблея ООН Генеральная Ассамблея ООН Украинский 424 798
6 Міжнародний суд ООН Международный суд ООН Международный суд ООН Украинский 419 807
7 Учасники російсько української війни Ш Участники российско-украинской войны Описание конфликта Украинский 346 453
8 Сакрэтнае ўварванне Тайное вторжение Мини сериал Марвел Беларуский 322 358
9 Поліський район Полесский район Бывший регион в Киевской области Украинский 314 356
10 Катэгорыя 24 лютага Категория: 24 февраля Список праздников 24 февраля Беларуский 293 363
Таблица 5: 10 самых продуктивных запросов по которым мы нашли наибольшее количество заблокированного в России контента

Видео заблокированные в Канаде

В отличие от России, наиболее продуктивные запросы в Канаде были связаны не с войной в Украине, а со спортом, музыкой и географическими объектами. Большинство запросов (шесть из десяти) связаны со спортом, в том числе: Кубок Дэвиса (на русском и беларуском языках), чемпионат мира по фигурному катанию и три разных футболиста (Чиро Иммобиле, Алехандро Гомес и Дуван Сапата). Также встречаются запросы, связанные с музыкой (K Ci & JoJo и Beatles Bootleg Recordings) и географическими местами (Locust и Charleroi). Запросы, которые привели к блокировке контента в Канаде, отличаются от российских и в большей степени ориентированы на развлечения, а не на текущие события.

Место в рейтинге Запрос Перевод Описание страницы Язык # заблокированных видео Всего
1 Локаст Locust Город в США Чеченский 161 284
2 Кубок Дэвиса Davis Cup Чемпионат по теннису Кубок Дэвиса Русский 123 457
3 Иммобиле Чиро Ciro Immobile Итальянский футболист Русский 78 194
4 Шарлеруа Charleroi Город в Бельгии Украинский 59 285
5 K Ci JoJo K Ci & JoJo Музыканты Грузинский 57 251
6 Чемпионат мира по фигурному катанию 2023 World Figure Skating Championships 2023 Событие связанное с фигурным катанием Русский 55 157
7 Кубак Дэвіса Davis Cup Кубок Дэвиса Беларуский 54 456
8 The Beatles Bootleg Recordings 1963 The Beatles Bootleg Recordings 1963 Компиляция музыки Beatles Грузинский 53 253
9 Алехандро Гомес Alejandro Gomez Аргентинский футболист Украинский 52 258
10 Сапата Дуван Duván Zapata Колумбийский футболист Русский 52 77
Таблица 6: Десять самых продуктивных запросов в Канаде по которым мы получили наибольшее количество заблокированных видео.

Языки заблокированных видео

В предыдущем разделе мы рассмотрели поисковые запросы, которые привели к обнаружению большого количества заблокированных видеороликов. В этом разделе мы проведем аналогичный анализ, но с учетом того, на каком языке Википедии был сделан поисковый запрос. Наша цель – выяснить, заголовки статей в каком языковом издании Википедии привели к наибольшему количеству заблокированных видео. Это делается для того, чтобы лучше понять языки видеоконтента, заблокированного в VK.

Видеоролики, заблокированные в России

Среди видео, заблокированных в России, наибольшая доля (61%) результатов блокировки видео приходится на запросы из украинской Википедии, за ней следует беларуская (36%), а на третьем месте – российская (1%). На все остальные языки (казахский, чеченский и грузинский) пришлось менее 0,3%. Непропорционально большое количество заблокированного украинского контента удивительно, поскольку после блокировки ВК в Украине в 2017 году среднесуточное количество посещений украинскими пользователями снизилось с 54% украинских интернет-пользователей до всего 10% украинских интернет-пользователей, посещающих ВК в тот или иной день. Более того, несмотря на то что ВК является российской социальной медиаплатформой, русскоязычные запросы в России привели к обнаружению лишь небольшой доли заблокированных видеороликов (1,33%). Однако эти результаты могут свидетельствовать лишь об эффективности цензуры в ВК, не позволяющей российским и, следовательно, в основном русскоязычным пользователям высказываться на подцензурные темы. Кроме того, социальная цена блокировки в России для россиян выше, чем для тех, кто находится за ее пределами, что еще больше сдерживает чувствительные политические высказывания пользователей в России.

Изображение 4: Количество заблокированных видео в России по языкам статей Википедии
Изображение 4: Количество заблокированных видео в России по языкам статей Википедии

 

Язык # количество запросов Доля
Украинский 148,313 61.56%
Беларуский 87,521 36.33%
Русский 3,205 1.33%
Казахский 854 0.35%
Чеченский 760 0.32%
Грузинский 264 0.11%
Таблица 7: Количество заблокированных видео в России по языкам статей Википедии

Видео заблокированные в Канаде

В отличие от России, где блокировалась большая доля видеороликов, запрашиваемых по названию статьи из украинской Википедии, языковой состав запросов, по которым обнаруживались заблокированные видеоролики в Канаде, заметно отличается. Среди видео, заблокированных в Канаде, русский язык представлен в нашем наборе данных наиболее широко – 43,44% результатов, за ним следуют казахский (20,47%) и грузинский (13,89%), а доля остальных языков (украинского, чеченского и белорусского) не превышает десяти процентов. В Канаде русский язык представлен гораздо шире (43,34% в Канаде по сравнению с 1,33% в России). Этот результат соответствует ожиданиям, поскольку ВК – российская социальная медиаплатформа с преимущественно русскоязычной пользовательской базой, и, следовательно, на такой платформе должно быть больше русскоязычного контента, нуждающегося в модерации, чем на любом другом языке.

Эти результаты отражают различия в мотивах блокировки видео в России и Канаде. В России VK блокирует контент, содержащий в основном определенные политические взгляды, которые часто высказывают украинские и белорусские пользователи. В то же время в Канаде VK блокирует контент, содержащий нарушения авторских прав, которые, как мы предполагаем, с одинаковой частотой совершаются носителями разных языков. Поскольку VK является российской платформой, мы ожидаем, что абсолютное число россиян, подвергшихся модерации, будет выше из-за их большей представленности на платформе.

Изображение 5: Количество заблокированных видео в Канаде по языкам статей Википедии
Изображение 5: Количество заблокированных видео в Канаде по языкам статей Википедии

 

Язык количество запросов Доля
Русский 1,426 43.44%
Казахский 672 20.47%
Грузинский 456 13.89%
Украинский 315 9.59%
Чеченский 237 7.22%
Беларуский 177 5.39%
Таблица 8: Количество заблокированных видео в Канаде по языкам статей Википедии

Кто опубликовал заблокированные видео?

Далее мы рассмотрим, кто разместил наибольшую долю наиболее заблокированного контента, обнаруженного нами в VK, чтобы получить представление о физических или юридических лицах, чей контент в наибольшей степени подвержен блокировке VK, и о том, что они размещают. Мы разделили наше исследование на два различных типа пользователей по странам: видео, размещенное личными аккаунтами, и видео, размещенное сообществами. Следует отметить, что на страницах некоторых сообществ может присутствовать бренд компании, но не всегда понятно, являются ли эти аккаунты официальными. VK предлагает систему верификации для компаний и брендов, но верификация является необязательной, и некоторые компании могут не знать или не хотеть проходить эту процедуру. В нашем обсуждении мы будем указывать, прошла ли компания верификацию.

Видео заблокированные в России опубликованные индивидуальными пользователями

Изучив заблокированные в России видеоролики, мы обнаружили 1429 персональных аккаунтов, заблокированных в стране. Среди них наибольшую долю (37%) заблокированных видеороликов составляет постер “Олег Скрипник”, за ним следуют “Дарина Иванив” (12%) и “Подрыв Устоев” (4%). На долю этих трех лидеров приходится 53% всех видеороликов, которые, как мы выяснили, были размещены с заблокированных личных аккаунтов, что подчеркивает чрезмерную представленность небольшого числа пользователей в сфере блокировки видео. Среди заблокированных личных аккаунтов большинство размещает политический контент лишь изредка, и их нельзя назвать аккаунтами, используемыми в основном для активизма. Несколько заблокированных личных аккаунтов, по всей видимости, принадлежат украинским военным и по-прежнему активны. Этот вывод показывает, что, несмотря на широкую критику VK как небезопасной и пророссийской платформы и несмотря на ее блокировку в Украине (см. Приложение А), она по-прежнему используется многими украинцами, в том числе и теми, кто сейчас находится на фронте.

Rank Profile URL Account Name Content Posted # of Videos Discovered Blocked Share
1 https://vk.com/skripoleg Oleg Skripnik Ukraine war content. 19,061 37.93%
2 https://vk.com/id576554975 Daryna Ivaniv Ukraine war content. 6,328 12.59%
3 https://vk.com/s.krupko63 Podryv Ustoev Ukraine war content. 2,131 4.24%
4 https://vk.com/id613313976 Daryna Ivaniv Ukraine war content. 1,228 2.44%
5 https://vk.com/id229910131 Masha Vedernikova Ukraine war content. 1,193 2.37%
6 https://vk.com/id303073458 Boris Suslenskiy Ukraine war content. 1,005 2.00%
7 https://vk.com/id157885457 Lyubov Platonova Ukraine war content. 770 1.53%
8 https://vk.com/id293387897 Vasily Zhazhakin Ukraine war content. 690 1.37%
9 https://vk.com/id129054771 Igor Zachosa Ukraine war content. 604 1.20%
10 https://vk.com/id22401146 Sergey Derkach Ukraine war content. 568 1.13%
Таблица 9: 10 личных аккаунтов с которых опубликовано наибольшее количество заблокированных в России видео.
Помимо 1429 заблокированных личных аккаунтов, найденных по заблокированным видеороликам, при прямом поиске по различным заголовкам статей в категории “Люди” мы обнаружили еще 19 заблокированных личных аккаунтов, поскольку они отсутствовали в результатах поискового запроса. Все эти дополнительные аккаунты связаны с “Правым сектором” – украинской националистической группировкой, за исключением одного аккаунта под названием “Femboy Developer” (см. таблицу 10).
Profile URL Title
https://vk.com/id315585161 Praviy-Sektor Zakarpattya
https://vk.com/id287586663 Praviy-Sektor Shishaki-Ray-Org
https://vk.com/id241957654 Pravy Sektor
https://vk.com/id253532397 Praviy-Sektor Peremishlyani
https://vk.com/id303491180 Pravy Sektor
https://vk.com/id257667002 Praviy-Sektor Praviy-Sektor
https://vk.com/id459902176 Pravy Sektor
https://vk.com/id247366231 Praviy-Sektor Chechelnik
https://vk.com/ukrop24 Praviy-Sektor Dikanka-Rayorg
https://vk.com/drogobych_ps Drogobich Praviy-Sektor
https://vk.com/id244694134 Pravy Sektor
https://vk.com/id289687245 Praviy-Sektor Kolomia
https://vk.com/id248075744 Pravy Sektor
https://vk.com/id297537442 Pravy Sektor
https://vk.com/id284720470 Praviy-Sektor Karlivka
https://vk.com/pszak Pravy Sektor
https://vk.com/id406055235 Praviy-Sektor Kolomia
https://vk.com/id366480496 Pravy Sektor
https://vk.com/femboy_dev Femboy Developer
Таблица 10: Аккаунты ещё 19 заблокированных пользователей которых мы обнаружили благодаря их отсутствию в выдаче по поисковым запросам.

Видео заблокированные в России опубликованные сообществами

Мы обнаружили 826 сообществ, заблокированных VK в России, по размещенным ими видеороликам, которые также были заблокированы в России. Десять из этих заблокированных сообществ расположены ниже в таблице 11 по количеству обнаруженных нами заблокированных видеороликов, размещенных ими. Среди этих сообществ – новостные (“Площадь”), украинские и беларуские патриотические (“Моя страна Беларусь”, “Моя Украина”, “Патриоты Украины”). Одно из сообществ посвящено работе регионального украинского телеканала “Первый канал – Первый городской”. Есть также сообщения об оппозиционных СМИ, ориентированных на Беларусь, таких как “Белсат ТВ”, “Радио Свобода” и “Европейское радио для Беларуси”. Среди них “Белсат ТВ” и “Радио Свобода” финансируются Польшей и США соответственно, а “Европейское радио для Беларуси” является независимым. Из всех этих аккаунтов только аккаунт “Европейского радио для Беларуси” является “верифицированным” в VK, хотя все эти группы размещают контент со своих страниц в сообществах.

Rank Profile URL # of Videos Discovered Blocked Share Content Posted Account Type
1 https://vk.com/ploshcha 23,265 18.00% Belarus content. News Poster
2 https://vk.com/euroradio 18,667 14.44% Verified account of European Radio for Belarus, nonprofit media for Belarus Media
3 https://vk.com/belsat_tv 10,738 8.31% Belsat TV, Polish state funded media for Belarus Media
4 https://vk.com/majabelarus 8,337 6.45% Belarus content. Patriotic Community
5 https://vk.com/radiosvaboda 7,354 5.69% Radio Svoboda Belarus, US state media for Belarus Media
6 https://vk.com/patrioty 4,320 3.34% Ukraine war content. Patriotic Community
7 https://vk.com/we.patriots 3,681 2.85% Ukraine war content. Patriotic Community
8 https://vk.com/1tv_kr_ua 2,564 1.98% Ukrainian Regional Television Media
9 https://vk.com/ua.insider 2,562 1.98% Ukraine war content. Nationalist
10 https://vk.com/war_for_independence 2,265 1.75% Ukraine war content. Patriotic Community
Таблица 11: 10 сообществ опубликовавших наибольшее количество заблокированных в России видео
Помимо этих десяти сообществ заблокированы и другие сообщества, в том числе украинские СМИ, такие как “Громадське” и BBC News Ukraine, а также беларуская оппозиционная газета “Наша Ніва” (Наша Ніва). Заблокировано также верифицированное сообщество команды Алексея Навального. Среди результатов мы также обнаружили сообщества, связанные со спортом, такие как “ФК Шахтер” (страница болельщиков донецкого футбольного клуба “Шахтер”) и By.Tribuna.com (белорусское отделение международного спортивного СМИ Tribuna).

Помимо 826 заблокированных сообществ, которые мы нашли по заблокированным видеороликам, при прямом поиске по различным заголовкам статей в категории “Сообщества” мы обнаружили еще 804 заблокированных сообщества, поскольку они отсутствовали в результатах поисковых запросов. В таблице 12 мы приводим десять запросов, которые привели к обнаружению наиболее заблокированных сообществ в России.

Rank Language Query Translation Types of Communities # of Communities Discovered Blocked Share
1 Russian Неодимовый магнит Neodymium magnet Sale of magnets to tamper with gas and water meters. 72 8.94%
2 Russian Европейская хартия местного самоуправления European Charter of Local Self-Government Pro-USSR regionalist groups 54 6.71%
3 Kazakh Сыпатай Саурықұлы Sypatai Saurykuly Sports wagering communities (query unrelated) 50 6.21%
4 Russian Фиктивный брак Fictitious marriage Communities to arrange fake marriages 42 5.22%
5 Belarusian Пуцін хуйло Putin is a dick Anti-Putin groups 38 4.72%
6 Ukrainian Кирило Лукаріс Kyrylo Loukaris Pill buying/selling (query unrelated) 38 4.72%
7 Ukrainian Національний корпус National Corps Nationalist communities 36 4.47%
8 Russian Партия националистического движения Nationalist Movement Party Nationalist communities 31 3.85%
9 Georgian Путин хуйло Putin is a dick Anti-Putin groups 27 3.35%
10 Chechen СагӀсена Sarcenas Ozempic sales (query unrelated) 28 3.48%
Таблица 12: 10 поисковых запросов которые дали нам наибольшее количество заблокированных в России сообществ.
Запрос, который привел к обнаружению наиболее цензурируемых сообществ, связан с продажей неодимовых магнитов (“Неодимовый магнит”), на него приходится более 8% обнаруженных нами заблокированных сообществ. Судя по содержанию страниц этих сообществ, речь идет о редкоземельных магнитах, которые рекламируются как средство для вскрытия счетчиков воды и газа. В описании одной из групп утверждается, что использование этих магнитов в подобных целях запрещено законом, что говорит об отсутствии последовательного правоприменения в этих сообществах. Многие другие поисковые запросы также связаны с потенциальным мошенничеством, например, с организацией фиктивных браков (“Фиктивный брак”), спортивными ставками, продажей таблеток и диетических добавок. Блокированы сообщества расистских и националистических группировок. Есть и сообщества, связанные с просоветскими регионалистскими группами (например, сообщество КНВР Удмуртской области [Община КНВР Удмуртского Региона]). Наконец, многие запросы и заблокированные группы содержат критику власти и оскорбления в адрес Путина, так как многие из них озаглавлены антипутинским лозунгом “Пуцін хуйло”, что переводится как “Путин – козел”.

По всей видимости, заблокированные сообщества имеют другую направленность по сравнению с заблокированными видео. Если в России заблокированные видеоматериалы в основном связаны с войной в Украине и в Беларуси, то заблокированные сообщества посвящены потенциальному мошенничеству. Однако есть и некоторые пересечения: расистский и националистический контент блокируется как в видео, так и в сообществах на территории России.

Видео заблокированные в Канаде опубликованные отдельными пользователями

В отличие от России, в Канаде в первой десятке личных аккаунтов, разместивших наибольшее количество заблокированных видео, все, кроме одного, размещали преимущественно музыкальный контент (см. табл. ТКТК). В Канаде в десятке лидеров не было ни одного видео, содержащего политические или актуальные события. Этот результат опять же отличается от российского. Следовательно, ВК в Канаде больше ориентирован на блокировку развлекательного контента, что, скорее всего, связано с авторскими правами.

Rank Profile URL Account Name Content Posted # of Videos Discovered Blocked Share
1 https://vk.com/ig.linevich Igor Linevich Music 182 25.63%
2 https://vk.com/id474426680 Vadim Popov Music 79 11.13%
3 https://vk.com/chertoritsky Sergey Chertoritsky Music 30 4.23%
4 https://vk.com/walema Stary Ded TV 21 2.96%
5 https://vk.com/step1972 Andrey Krivopishin Music 13 1.83%
6 https://vk.com/blogthe The Blog Music 7 0.99%
7 https://vk.com/sergeylzar Sergey Lazarikhin Music 7 0.99%
8 https://vk.com/s.pantsyrny Slava Pantsyrny Music 6 0.85%
9 https://vk.com/id3788507 Alexander Kukhtin Music 5 0.70%
10 https://vk.com/id243891102 Lasha Ujmachuridze Music 4 0.56%
Таблица 13: 10 индивидуальных аккаунтов опубликовавших наибольшее количество видео заблокированных в Канаде

Видео заблокированные в Канаде опубликованные сообществами

Эта тенденция блокирования развлекательного контента сохраняется и в Канаде для сообществ, разместивших видеоматериалы, которые были заблокированы в Канаде. Шесть из десяти заблокированных видеопостеров сообществ посвящены спорту, три – музыке и один – мультфильмам. В центре внимания также находятся каналы российских медиапроизводителей, включая телевидение (“Телеспорт”, “Окко Спорт”, “Матч Премьер”) и радио (ОМСК 103,9 FM). Этим контент отличается от заблокированных общественных афиш в России, которые также включают СМИ, но сфокусированы в основном на политике и текущих событиях (“Белсат”, “Радио Свобода”, “Еврорадио”).

Rank Content Poster # of Videos Discovered Blocked Share Content Posted Account
1 https://vk.com/telesport 533 27.57% Sports Russian sports television “Tele Sport”
2 https://vk.com/serieavk 313 16.19% Sports Community for Italian Soccer League “Serie A”
3 https://vk.com/silatv 206 10.66% Sports Russian sports television “Tele Sport”
4 https://vk.com/locasta 161 8.33% Music “Locasta” street dancing clips
5 https://vk.com/okkotennis 119 6.16% Sports Russian TV “Okko Sport” tennis community
6 https://vk.com/okkosport 103 5.33% Sports Russian TV sports station “Okko Sport”
7 https://vk.com/sibiromsk 39 2.02% Music Russian radio station OMSK 103.9 FM
8 https://vk.com/2pac_one_nation 30 1.55% Music Fan community for musician Tupac Shakur
9 https://vk.com/matchpremier 29 1.50% Sports Russian sports television station “Match Premier”
10 https://vk.com/public207473513 26 1.35% Cartoons Community for “Davv Productions”
Таблица 14: Десять сообществ опубликоваваших наибольшее количество видео заблокированных в Канаде.

Каково содержание заблоикрованных видео?

В связи с большим количеством обнаруженных нами заблокированных видеороликов просмотр и классификация всего контента были бы нецелесообразны. Поэтому для выявления общей тематики заблокированного контента мы произвольно отобрали 30 видеороликов, заблокированных в России, и 30 видеороликов, заблокированных в Канаде, просмотрели их и распределили по категориям в соответствии с их содержанием.

Видео заблокированные в России

Среди 30 отобранных заблокированных в России видеороликов наибольшую долю (43%) составляют ролики, связанные с войной в Украине. Среди просмотренных видеоматериалов – кадры военных действий, демонстрация боеприпасов, интервью с военнослужащими, ток-шоу с обсуждением военных действий. Следующая по величине категория заблокированного контента – видеоматериалы, связанные с Беларусью (26%), включающие видеозаписи акций протеста, а также новостные сообщения о погибших, задержанных и трагедиях. Третья наиболее часто встречающаяся категория блокируемого контента – это контент, не связанный с войной в Украине (13%), который включает освещение экономических проблем и маршей националистов.

Изображение 6: Категории заблокированных в России видео из нашей рандомной выборки
Изображение 6: Категории заблокированных в России видео из нашей рандомной выборки

 

Missing in Russia Category Notes
https://vk.com/video-36069860_166138550 Belarus Protest around death of Belarussian in pretrial detention
https://vk.com/video-36069860_456240026 Belarus Debate between a Belarusian opposition leader Dashkevich and undercover police
https://vk.com/video155142793_456347462 Belarus Moving Iskander-2 missiles to Belarus
https://vk.com/video-36069860_456252718 Belarus Radio Svoboda coverage of detained Belarussian photographer
https://vk.com/video-36069860_456246093 Belarus TV coverage of the 1999 stampede tragedy in Minsk.
https://vk.com/video-22639447_456254462 Belarus Death of Belarusian scientist Boris Kit
https://vk.com/video-22639447_456264287 Belarus Message from Minsk Workers to Lukashenko’s Trade Union
https://vk.com/video-72572911_456243223 Criminal Ukrainian anti-corruption TV program
https://vk.com/video613313976_456244390 History Educational audio program describing Ukrainian writer and poet Borys Antonenko-Davydovych
https://vk.com/video-23282997_159220433 History Educational video about judging in Middle ages Lithuania and Ukraine
https://vk.com/video-18162618_456243561 Sports Interview with a Shakhtar Donetsk player.
https://vk.com/video-155655277_456239927 Sports Ukrainian first league match between FC Hirnyk-Sport and FC Prykarpattia
https://vk.com/video576554975_456272785 Ukraine (Non-War) Interfax press conference regarding the “Mask-Show-Stop” law in pretrial detention.
https://vk.com/video-24262706_161238424 Ukraine (Non-War) Footage of UPA (Nationalist) march in Kiev
https://vk.com/video155142793_456308212 Ukraine (Non-War) Espreso TV coverage of tax evasion enforcement
https://vk.com/video374267542_456248883 Ukraine (Non-War) Coverage around spending by Speaker of the Verkhovna Rada of Ukraine Andriy Parubiy
https://vk.com/video-11019260_456247195 Ukraine War Ukrainian AF Russian Legion, military
https://vk.com/video155142793_456272568 Ukraine War DShK machine guns in Luhansk region competition, military
https://vk.com/video-93448512_456240035 Ukraine War War footage Ukrainian soldiers inspect destroyed Russian positions
https://vk.com/video715174916_456239961 Ukraine War Political talk show touching topics in Russia and Ukraine
https://vk.com/video155142793_456332156 Ukraine War Commentary about Ukraine and Russia
https://vk.com/video-5063972_456241071 Ukraine War Interview with soldiers in Ukrainian village of Yasinuvata
https://vk.com/video549895_456239793 Ukraine War Commentary about the Ukrainian war.
https://vk.com/video62649817_456252177 Ukraine War News coverage about Ukrainian war, Bucha massacre and Kremlin actions
https://vk.com/video-5063972_118384509 Ukraine War Promotional video about Ukrainian marine unit
https://vk.com/video535771132_456240850 Ukraine War Ukrainian security service intercept of battlefield communications.
https://vk.com/video11405356_456239226 Ukraine War A video with a fake “horoscope” that recommends to donate to Ukrainian army
https://vk.com/video-72589198_456240902 Ukraine War Interview with Ukrainian service member.
https://vk.com/video-23502694_456244444 Ukraine War Video of Ukrainian Armed Force tanks
https://vk.com/video-54899733_456240014 USA Biden and Obama at Medal of Honor ceremony.
Таблица 15: Категории заблокированных в России видео из нашей рандомной выборки

Videos blocked in Canada

Мы также произвольно отобрали 30 видеороликов, заблокированных в Канаде, и классифицировали их содержание. В отличие от категорий, заблокированных в России, которые в основном были связаны с войной в Украине и в Беларуси, заблокированный контент в Канаде в большей степени относится к развлечениям, в частности к спорту (57%), музыке (40%) и телевизионным программам (3%). Эти категории отражают, что основной мотив блокировки в Канаде связан с соблюдением авторских прав. В Канаде полностью отсутствует блокировка политического, новостного и событийного контента, который преобладает в выборке заблокированного видео в России. Эти результаты еще раз свидетельствуют о том, что цели цензуры в Канаде сильно отличаются от целей цензуры в России: в первом случае она направлена на соблюдение авторских прав, а во втором – на новости, текущие события и политику.

Изображение 7: Категории видео заблокированных в Канаде из нашей рандомной выборки.
Изображение 7: Категории видео заблокированных в Канаде из нашей рандомной выборки.

 

Video Missing in Canada Category Notes
https://vk.com/video-29412860_456240167 Music Radio broadcast
https://vk.com/video2560911_153209689 Music Music video
https://vk.com/video177634113_456239296 Music Music video
https://vk.com/video-41138955_456239155 Music Music video
https://vk.com/video179151037_456245514 Music Music video
https://vk.com/video-175484418_456239085 Music Music video
https://vk.com/video13944339_456240104 Music Music video
https://vk.com/video-116705_456241000 Music Music video
https://vk.com/video5958883_105821112 Music Music video
https://vk.com/video7238152_456244997 Music Music video
https://vk.com/video179151037_456241049 Music Music video
https://vk.com/video-58492936_456239429 Music Music video
https://vk.com/video-151498735_456245443 Sports Soccer
https://vk.com/video-198813611_456240402 Sports Soccer
https://vk.com/video-141682278_456244465 Sports Soccer
https://vk.com/video-198813611_456240223 Sports Soccer
https://vk.com/video-141682278_456249621 Sports Soccer
https://vk.com/video-198813611_456239230 Sports Soccer
https://vk.com/video-141682278_456249560 Sports Soccer
https://vk.com/video-202752058_456239622 Sports Tennis
https://vk.com/video-141682278_456240046 Sports Soccer
https://vk.com/video-141682278_456245917 Sports Soccer
https://vk.com/video-202752058_456239667 Sports Tennis
https://vk.com/video-141682278_456241114 Sports Soccer
https://vk.com/video-141682278_456241024 Sports Soccer
https://vk.com/video-151498735_456247745 Sports Soccer
https://vk.com/video-198813611_456240557 Sports Soccer
https://vk.com/video-198813611_456240867 Sports Soccer
https://vk.com/video-198813611_456239868 Sports Soccer
https://vk.com/video-156580570_456241205 TV Beating Again (순정에 반하다), Season 1, Episode 8
Таблица 16: Категории видео заблокированных в Канаде из нашей рандомной выборки

Сообщения о блокировках

В этом разделе мы рассмотрим сообщения о блокировке, которые выдаются пользователям при попытке посетить страницы с заблокированным контентом в России и Канаде. Мы обнаружили, что весь контент, заблокированный в одном регионе, но доступный в другом, выдает пользователям сообщение, объясняющее причину недоступности контента.

Мы обнаружили 336 уникальных сообщений, передаваемых пользователям при попытке получить доступ к заблокированному контенту в России. Во всех сообщениях, кроме одного, в качестве обоснования блокировки приводится решение российского суда. Единственное сообщение, в котором не упоминается решение российского суда, – это более общее сообщение “Это видео недоступно в вашей стране”, которое затронуло пять видеороликов. Остальные 335 сообщений написаны на русском языке, и в них в аналогичном формате объясняется, что видео заблокировано в Российской Федерации, а также указывается, кто запросил блокировку, номер и дата соответствующего дела.

Несмотря на то, что всего было обнаружено более трехсот сообщений о блокировке, десять наиболее часто встречающихся сообщений составляют подавляющее большинство (77,15%) заблокированных видеороликов. Сообщение, обосновывающее наибольшее количество заблокированных видеоматериалов (33 252 видеоматериала, или 35%), было запрошено Генеральной прокуратурой с указанием номера дела “27-31-2020/Ид2145-22” и датировано 24 февраля 2022 года. Хотя нам не удалось найти текст этого судебного решения, на этот же номер дела ссылался российский регулятор связи Роскомнадзор при блокировке 6 037 сайтов, и, учитывая сроки, мы предполагаем, что оно связано с полномасштабным вторжением России в Украину.

Rank Message Translated Message # of Videos Discovered Blocked Share Cumulative Share
1 Этот материал заблокирован на территории РФ согласно требованию Генеральной прокуратуры Российской Федерации от 24.02.2022 № 27-31-2020/Ид2145-22 This material is blocked on the territory of the Russian Federation in accordance with the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2020/Id2145-22 dated 24.02.2022 33,252 35.02% 35.02%
2 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 12.03.2015 № 27-31-2015/Ид831-15 This material is blocked on the territory of the Russian Federation on the basis of the request of the General Prosecutor’s Office of the Russian Federation from 12.03.2015 № 27-31-2015/Id831-15 11,943 12.58% 47.60%
3 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры РФ от 24.02.2022 № 27-31-2020/Ид2145-22 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2020/Id2145-22 dated 24.02.2022 7,776 8.19% 55.79%
4 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 05.04.2022 № 27-31-2022/Ид4465-22 This material is blocked on the territory of the Russian Federation on the basis of the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id4465-22 dated 05.04.2022 6,373 6.71% 62.51%
5 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 25.04.2022 № 27-31-2022/Ид5587-22 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id5587-22 dated 25.04.2022 3,013 3.17% 65.68%
6 Этот материал заблокирован на территории РФ согласно требованию Генеральной прокуратуры Российской Федерации от 27.02.2022 № 27-31-2022/Треб228-22 This material is blocked on the territory of the Russian Federation in accordance with the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Treb228-22 dated 27.02.2022 2,928 3.08% 68.76%
7 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 13.08.2022 № 27-31-2022/Иф-10643-22 This material has been blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/IF-10643-22 dated 13.08.2022 2,726 2.87% 71.63%
8 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 09.08.2022 № 27-31-2022/Ид11013-22 This material has been blocked on the territory of the Russian Federation based on the request of the Prosecutor General’s Office of the Russian Federation № 27-31-2022/Id11013-22 dated 09.08.2022 2,136 2.25% 73.88%
9 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры РФ № 27-31-2022/Ид13719-22 от 30.09.2022 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id13719-22 of 30.09.2022 1,645 1.73% 75.62%
10 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации № 27-31-2022/Треб855-22 от 30.07.2022 This material has been blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Treb855-22 of 30.07.2022 1,456 1.53% 77.15%
Таблица 17: 10 сообщений о блокировках которые мы обнаружили в большинстве случаев блокировок видео в России.
Самая ранняя дата суда, упомянутая в сообщении блока, – 2 марта 2014 года, а самая последняя – 28 апреля 2023 года, то есть незадолго до окончания нашего тестового периода – 14 мая 2023 года. Этот диапазон охватывает большой временной промежуток – 8 лет и 11 месяцев. Рассматривая кумулятивное распределение дат цитирования судебных дел в сообщениях, мы видим, что после 24 февраля 2022 года (см. рис. ТКТК и табл. ТКТК), когда началось полномасштабное вторжение России на Украину, наблюдается рост числа дат цитирования судебных дел. До этого периода наблюдалась стабильная и относительно постоянная частота упоминания дат в обоснованиях. Начиная с конца октября – начала ноября 2022 года и до конца нашего тестового периода в мае 2023 года этот темп снижается. Кроме того, с 26 декабря 2022 года по 26 января 2023 года наблюдался промежуток, когда ни один случай не был упомянут в обосновании, хотя это может быть объяснено, по крайней мере частично, празднованием Рождества Христова в восточной православной церкви. Причина короткого периода снижения темпов и разрыва неясна. В целом сроки этих изменений свидетельствуют о том, что продолжающийся конфликт резко увеличил скорость блокировки видеоконтента для российских пользователей.
Изображение 8: Среди 336 сообщений о блокировках, содержащих ссылки на судебные дела, кумулятивное распределение дат судебных дел во времени; красным цветом отмечен рост числа судебных решений, вынесенных после полномасштабного вторжения в Украину 24 февраля 2022 г.; желтым - снижение этого показателя с конца октября - начала ноября 2022 г. и до конца периода измерений; зеленым - разрыв в количестве наблюдаемых судебных решений между 26 декабря 2022 г. и 26 января 2023 г.
Изображение 8: Среди 336 сообщений о блокировках, содержащих ссылки на судебные дела, кумулятивное распределение дат судебных дел во времени; красным цветом отмечен рост числа судебных решений, вынесенных после полномасштабного вторжения в Украину 24 февраля 2022 г.; желтым – снижение этого показателя с конца октября – начала ноября 2022 г. и до конца периода измерений; зеленым – разрыв в количестве наблюдаемых судебных решений между 26 декабря 2022 г. и 26 января 2023 г.

 

Time period Court orders per day Comparison to previous period
March 2, 2014 – February 23, 2022 0.0271
February 24, 2022 – October 31, 2022 0.826 Rate increased by factor of 30.5
November 1, 2022 – April 28, 2023 0.200 Rate decreased by factor of 4.14
Таблица 18: Сравнение количества судебных решений

Видео заблокированные в Канаде

В отличие от России, среди видео, заблокированных в Канаде, нет ни одного сообщения, возвращаемого пользователям, в котором бы приводилось какое-либо юридическое обоснование блокировки контента. Единственные два сообщения о блокировке, которые, по нашим наблюдениям, оправдывают блокировку видео в Канаде, – это более общее сообщение “Это видео недоступно в вашей стране” (87,56%) и “Звук видео недоступен” (12,44%). Ограничение звука, если он содержит музыкальный контент, защищенный авторским правом, является обычной практикой модерации социальных сетей. Эти сообщения в Канаде разительно отличаются от сообщений в России, которые более разнообразны и в подавляющем большинстве случаев ссылаются на решение суда.

Message # of Videos Discovered Blocked Share
This video is unavailable in your country 2,288 87.56%
Video sound unavailable. 325 12.44%
Таблица 19: Два сообщения о блокировке которые мы обнаружили для видео заблокированных в Канаде.

СМ ОРИГИНАЛ ОТЧЁТА ДЛЯ ДОСТУПА К ОГРАНИЧЕНИЯМ НАШЕГО ИССЛЕДОВАНИЯ

Выводы и обсуждение результатов исследования

В этом разделе мы рассматриваем как наше исследование помогает понять российскую цензуру в соцсетях и сравнить её с иностранной цензурой. Мы сравниваем российский подход и китайский.

Широкая блокировка по ключевым словам связанным с ЛГБТКИ

Мы обнаружили, что поиск сообществ и личных аккаунтов в России подвергался цензуре, когда поисковые запросы содержали ключевые слова, связанные с ЛГБТКИ. Мы обнаружили, что фильтрация по ключевым словам применяется исключительно к терминам ЛГБТКИ в России и не действует в Канаде и Украине. Более того, непонятно, почему такая фильтрация применяется только к поиску сообществ и личных аккаунтов, но не к поиску видео. Чтобы подчеркнуть, что эти термины не подвергаются цензуре как часть фильтра “18+” или безопасного поиска, а используются только для фильтрации ЛГБТКИ, мы дополнительно проверили следующие поисковые запросы:

  • pornography

  • порнография

  • porn

  • порно

  • sex

  • секс

  • fuck

  • ебать

  • блять

  • трахаться

  • трахать

  • anal

  • анальный

  • bitch

  • сука

  • pussy

  • пизда

Поскольку ни один из вышеперечисленных терминов не вызвал цензуру по ключевым словам в наших поисковых запросах, можно сделать вывод, что цензура по ключевым словам, связанная с ЛГБТИК, не является частью более широкой функции безопасного поиска, а предназначена исключительно для поисковых запросов, связанных с ЛГБТИК.

Непонятно, почему фильтрация по ключевым словам используется только для цензуры ЛГБТИК-запросов, а не запросов, содержащих критику Путина, вторжения в Украину или другого контента, заблокированного в других местах ВК. Блокировка по ключевым словам – это очень примитивный инструмент. С одной стороны, она слишком широка и позволяет блокировать контент, который, возможно, не был предназначен для этого. Например, мы обнаружили, что в ВК существует множество анти-ЛГБТИК групп, и, таким образом, блокировка поисковых запросов, связанных с ЛГБТИК, не позволяет пользователям обнаружить как про-, так и анти-ЛГБТИК группы. С другой стороны, блокировка по ключевым словам одновременно является узкой. Например, мы обнаружили, что блокировались слова “ЛГБТ” и “ЛГБТК”, но не другие варианты, такие как “ЛГБТКИА”. Другой пример: хотя слово “гей” подвергалось цензуре, “геи” – нет. Некоторые термины блокировались как кириллицей, так и латиницей (например, “Геи” и “геи”), другие – только кириллицей, но не латиницей (например, “Фембой”, но не “фембой”). Подобные несоответствия создают впечатление, что список заблокированных терминов, используемых в VK, был составлен произвольно. Наконец, поскольку фильтрация по ключевым словам применяется только к поиску, пользователи по-прежнему могут получить доступ к сообществам и личным аккаунтам, названия которых содержат заблокированные ключевые слова, путем поиска по другим ключевым словам в их названиях или путем прямого ввода URL-адресов их страниц.

Учитывая, что блокировка по ключевым словам является одновременно и слишком широкой, и слишком узкой, а также неэффективной, непонятно, почему она применяется только к ЛГБТИК-контенту, тем более к любому контенту вообще. Возможно, что, поскольку законы о “пропаганде” ЛГБТ (в том числе федеральный закон “О защите детей от информации, пропагандирующей отрицание традиционных семейных ценностей”) нечетко определяют, что является “пропагандой ЛГБТ”, этот тип фильтрации призван быть очень заметным для пользователей, хотя на самом деле он неэффективен для цензуры контента. В этом смысле такая фильтрация может выступать в качестве своеобразного “флажка соответствия”, призванного продемонстрировать соблюдение российского законодательства.

Последовательное юридическое обоснование

Мы обнаружили, что VK приписывает каждому заблокированному сообществу или лицу в России судебное решение, а каждое заблокированное видео в России – заблокированному сообществу или лицу. Всего было найдено 336 различных сообщений о блокировке ВК, в которых приводились 303 уникальных номера судебных дел. В некоторых случаях нам удалось найти текст судебного решения о блокировке сообществ или личных аккаунтов и восстановить закон, на который ссылались при вынесении решения о блокировке. Необходимо провести дополнительные исследования для систематического анализа судебных дел и законов, обосновывающих решения о блокировке ВК, и определить, ссылается ли ВК на соответствующие судебные решения для обоснования блокировки и ссылается ли на соответствующие законы для обоснования постановлений о блокировке. Представляется, что во многих случаях необходимая информация может быть доступна для проведения такого анализа. В настоящее время мы ограничимся рассмотрением одного заблокированного сообщения, которое примечательно тем, что в нем также был процитирован пресс-релиз:

“Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Металлургический районный суд г. Челябинска – Челябинская область) от 11.12.2019 № 2а-3052/2019 Комментарий ВКонтакте: vk.com/press/blocking-public38905640

В пресс-релизе, опубликованном в марте 2021 г., в обоснование блокировки сообщества “Альянс гетеросексуалов и ЛГБТ за равноправие” VK ссылается на ужесточение российского законодательства в отношении социальных сетей и юридические обязательства по осуществлению проактивных мер цензуры.

Отсутствие прозрачности блокировок

Хотя VK постоянно объясняет блокировку в России судебными решениями, подход VK, при котором блокируются пользователи, а затем транзитом все их видео, а не блокируются конкретные видео, по-прежнему не является прозрачным на многих уровнях. Несмотря на то, что VK постоянно предоставляет юридическое обоснование причин блокировки сообщества или личного аккаунта в России, при просмотре заблокированного видео неясно, кто его автор, и даже если он известен, другим пользователям VK неясно, какое видео или другой контент этого пользователя может быть причиной их блокировки. Эта проблема усугубляется тем, что при блокировке VK перехватывает все прошлые и будущие размещенные видеоролики заблокированного сообщества или личного аккаунта. Таким образом, подход VK имеет тенденцию к чрезмерной блокировке, поскольку сообщество или личный аккаунт могут иметь множество интересов и размещать контент на различные темы, в том числе и доброкачественный, не имеющий отношения к первоначальному обоснованию блокировки. Проанализировав некоторые судебные постановления, на которые ссылается ВК при обосновании блокировки аккаунтов, мы обнаружили, что они не имеют привязки по срокам. Таким образом, блокировки могут применяться бессрочно, что усугубляет чрезмерную блокировку. Кроме того, неясно, уведомляет ли VK пользователя о том, что его контент заблокирован в России. Таким образом, пользователи VK могут не знать о том, что весь их контент недоступен для пользователей в России, особенно если они пользуются VK не из России.

Непоследовательное применение закона о защите авторского права

Мы обнаружили, что в Канаде часто блокировался развлекательный контент, защищенный авторским правом, включая телепередачи, спортивные и музыкальные видео, в то время как в России чаще всего блокируются видео связанные с новостями и текущими событиями, в основном по теме войны в Украине и протестов в Беларуси. Таким образом, контент, защищенный авторским правом, был в основном доступен в России, даже если он был заблокирован в Канаде. Хотя в данном отчете мы не проводили систематического сравнения Украины и Канады на предмет дифференцированной блокировки, в целом мы отметили, что тот же самый контент, недоступный в Канаде, был доступен в Украине. Это свидетельствует о том, что VK подходит к модерации авторских прав по географическому принципу. Согласно нашему анализу, подход VK к модерации авторских прав в России и Украине гораздо мягче и разрешительнее, чем в Канаде. То есть для пользователей VK в Канаде ограничивается больше контента на основании нарушения авторских прав, чем для пользователей в России и в Украине. Несмотря на такое неравномерное применение мер по защите авторских прав, мы также обнаружили, что пиратский контент широко распространен на платформе. Особенно это касается электронных книг и музыкального контента, которые широко доступны на VK.

Такое дифференцированное отношение к пользователям по регионам проявляется и в других аспектах, например, в политике конфиденциальности VK, которая предусматривает разные правила хранения данных для российских пользователей и пользователей за пределами России. Например, согласно этим правилам, VK “хранит сообщения российских пользователей в течение шести месяцев, а другие данные – в течение года (в соответствии с п. 3 ст. 10.1 Федерального закона “Об информации, информационных технологиях и о защите информации”)”.

Сравнение с китайской цензурой в соцсетях

Китайская система контроля информации в социальных сетях децентрализована и характеризуется “ответственностью посредников”, или, как говорят в Китае, “самодисциплиной”, что позволяет китайскому правительству перекладывать ответственность за контроль информации на частный сектор. Интернет-операторы, не обеспечившие должного контроля информации, могут быть оштрафованы, лишены лицензии на ведение деятельности или подвергнуты другим неблагоприятным действиям. Этим компаниям приходится самостоятельно решать, что именно подвергать активной цензуре на своих платформах, пытаясь соблюсти баланс между ожиданиями своих пользователей и умиротворением китайского правительства. В Китае сообщения о блокировке часто не отображаются на китайских платформах, поэтому пользователи не имеют возможности узнать юридическое обоснование блокировки контента. Однако в России VK в конечном итоге приписывает блокировку каждого видеоролика, сообщества или человека тому судебному делу, которое вынесло решение о блокировке этого контента. В некоторых случаях нам удалось найти текст судебного дела и восстановить законы, на которые ссылался ВК при обосновании запроса на блокировку. Несмотря на то, что в российской системе блокировки по решению суда многое не соблюдается, она все же более прозрачна, чем в Китае, где решения о блокировке в большей степени принимаются частным сектором, а решения о блокировке в значительной степени остаются на усмотрение интернет-операторов.

Китайские компании, работающие в сфере социальных сетей, с трудом справляются с развитием своих платформ в глобальном масштабе и контролем за информацией при их расширении. WeChat компании Tencent подвергается тщательному изучению в связи с тем, что в нем явно или тайно применяется китайская политическая цензура и слежка за разговорами пользователей, полностью зарегистрированных за пределами Китая. Кроме того, при использовании WeChat пользователи не могут определить, общаются ли они с пользователем, зарегистрированным в Китае, и поэтому не могут предсказать, в какой степени их сообщения будут подвергаться политической цензуре или слежке. В отличие от Tencent, компания Bytedance просто отказалась от идеи создания единой платформы с радикально различными правилами контроля информации для пользователей в Китае и за его пределами. Вместо этого Bytedance управляет платформой Douyin внутри Китая и платформой TikTok за его пределами с совершенно иной пользовательской базой. Подход VK, при котором блокируются сообщества и учетные записи пользователей, но не непосредственно контент, может иметь определенные преимущества в снижении трудностей при попытке расширения VK в глобальном масштабе или за пределами российского режима контроля информации. В VK пользователи в России просто не могут общаться или читать контент пользователей, заблокированных в России, и поэтому в СМИ не появлялось негативных материалов об удалении контента нероссийских пользователей в стиле тех, что освещались в WeChat. Это различие объясняется тем, что в VK политически мотивированная блокировка, по-видимому, применяется только к пользователям, а не к отдельным пользователям.

Хотя и Китай, и Россия используют цензуру в Интернете для защиты политического имиджа своих лидеров, они непоследовательны в том, как они защищают имидж лидеров других стран. Хотя китайские интернет-платформы, по-видимому, готовы защищать образ Путина, мы не обнаружили никаких свидетельств того, что VK блокирует контент, критикующий Си Цзиньпина или любого другого китайского лидера. В ходе продолжающегося исследования цензуры на китайских поисковых платформах мы обнаружили, что китайские поисковые системы Baidu и Sogou, а также сайт обмена видеоматериалами Bilibili применяют правила цензуры, связанные с “普京” [Путин]. В качестве примера можно привести поисковые запросы на Sogou, содержащие “普京 + 独裁” [Путин + диктатура], “普京 + 希特勒” [Путин + Гитлер] или “普京窃国” [клептократия Путина], которые ограничивают результаты поиска только сайтами китайских государственных СМИ и другими источниками, поддерживающими Пекин. В то время как некоторые цензурные правила, как представляется, направлены исключительно на защиту имиджа Путина, другие могут свидетельствовать о не совсем альтруистических мотивах Китая. Например, “普京亲信兵变 + 震动中南海” [мятеж приближенных Путина + тряска в штаб-квартире Компартии Китая] и “台湾 + 成为下一个乌克兰” [Тайвань + превращение в следующую Украину] показывают неуверенность Китая в том, как Мятеж Пригожина может быть предсказанием будущей стабильности режима китайской компартии, а непредвиденные трудности России с вторжением на Украину – предсказанием будущей реализации амбиций Китая по захвату Тайваня. В более общем плане китайские цензоры могут быть мотивированы защитой

И наконец, хотя существуют теории о том, что Интернет “балканизируется” или превращается в “расколотый Интернет”, в котором разные страны или регионы со временем формируют свои собственные изолированные сети, примеры цензуры в социальных сетях Китая и России показывают, что границы этих изолированных сетей могут быть достаточно свободными, но только в одном направлении. В WeChat пользователи, зарегистрированные в Китае, подвергаются жесткой политической цензуре, в то время как пользователи из других стран могут не только заходить в WeChat, но и выражать друг другу политические идеи с относительной свободой по сравнению с их китайскими коллегами. То же самое мы наблюдаем и в случае с VK: пользователи в России подвергаются повсеместной политической цензуре, в то время как пользователи в других странах не только имеют доступ к сайту, но и относительно более свободны в своих политических высказываниях. По иронии судьбы, каждая из этих социальных сетей подчиняет пользователей из страны, в которой она была основана, наибольшим ограничениям, в то время как эти сети не только позволяют пользователям из других стран присоединяться к ним, но и предоставляют им свободу для более широкого выражения политических взглядов.

Данные

Полный список видео, сообществ и отдельных аккаунтов которые мы обнаружили заблокированными в России и Канаде, а также соответствующие сообщения о блокировках доступны на GitHub по следующей ссылке:

https://github.com/citizenlab/not-ok-on-vk-data

Благодарности

Мы хотели бы выразить благодарность Michelle Akim (Мишель Аким), Siena Anstis (Сиена Антис), Pellaeon Lin (Пеллаьон Лин), Irene Poetranto (Ирен Поетранто), Adam Senft (Адам Сенфт) и Андрею Солдатову за их ценные рекомендации и пир-ревью. Исследование было проведено под контролем директора лаборатории Citizen Lab, профессора Ron Deibert (Рон Диберт).

]]>
Not OK on VK: An Analysis of In-Platform Censorship on Russia’s VKontakte https://citizenlab.ca/2023/07/an-analysis-of-in-platform-censorship-on-russias-vkontakte/ Wed, 26 Jul 2023 13:59:23 +0000 https://citizenlab.ca/?p=79694 Key findings
  • This report examines the accessibility of certain types of content on VK (an abbreviation for “VKontakte”), a Russian social networking service, in Canada, Ukraine, and Russia.
  • Among these countries, we found that Russia had the most limited access to VK social media content, due to the blocking of 94,942 videos, 1,569 community accounts, and 787 personal accounts in the country.
  • VK predominantly blocked access to music videos and other entertainment content in Canada, whereas, in Russia, we found VK blocked content posted by independent news organizations, as well as content related to Ukrainian and Belarusian issues, protests, and lesbian, gay, bisexual, transgender, intersex, and queer (LGBTIQ) content. In Ukraine, we discovered no content that VK blocked, though the site itself is blocked to varying extents by most Internet providers in Ukraine.
  • In Russia, certain types of video content were inaccessible on VK due to the blocking of the accounts of the people or communities who posted them. These individuals and groups were often targeted for their criticism of Russia’s President Vladimir Putin or of the Russian invasion of Ukraine. Additionally, accounts belonging to these communities and people have been restricted from VK search results in Russia using broad, keyword-based blocking of LGBTIQ terms.
  • We collected over 300 legal justifications which VK cited in justification of the blocking of videos in Russia. Notably, we discovered a 30-fold increase in the rate of takedown orders issued against VK in an eight month period following Russia’s February 2022 invasion of Ukraine.

Introduction

While China is known for fostering its own ecosystem of social media platforms such as the chat app WeChat and microblogging platform Weibo and blocking their American counterparts (e.g., WhatsApp and Twitter), Russia has allowed access to WhatsApp and Twitter, but has also put a considerable effort into deploying and promoting Russian equivalents. For example, VK and Odnoklassniki, which are roughly Facebook equivalents, Rutube, a Russian equivalent of YouTube, and Yandex which is equivalent to Google Search. In 2022, Runniversalis, a pro-Kremlin version of Wikipedia was launched, reminiscent of Chinese efforts such as Baidu Baike to create a domestic clone of Wikipedia. Although many North American social media platforms remain accessible in Russia, Russia eventually blocked Facebook and Twitter following the 2022 full-scale invasion of Ukraine.

Chinese social media platforms, which are known to apply pervasive political and religious censorship to their Chinese users, take a variety of approaches to treating their non-Chinese users who may have different expectations concerning freedom of speech. While many platforms such as Weibo apply their political censorship even to users outside of China, others such as WeChat, in a bid to try to appeal to non-Chinese users, apply fewer speech restrictions to them. Other companies, such as Bytedance, take the approach of maintaining distinct platforms inside China (Douyin) versus elsewhere (TikTok). Like Chinese platforms, Russian platforms are also known to perform political censorship. However, the mechanisms the latter use to apply censorship, what topics they censor, and if or how those mechanisms apply to users outside of Russia are issues that are still understudied in the research on information controls.

Internet censorship in Russia is enforced through a variety of legal mechanisms. The Federal Service for Supervision of Communications, Information Technology and Mass Media (Roskomnadzor), as the Internet regulator, maintains a centralized “blacklist” governing the blocking of IP addresses, domain names, and unencrypted HTTP URLs, which Internet service providers (ISPs) in Russia are legally obliged to implement. However, the censorship of social media content, which, due to HTTPS encryption, cannot be individually blocked by ISPs, is maintained through other legal mechanisms such as court orders. Multiple government (e.g., the Roskomnadzor and the office of the Prosecutor General) and non-government agencies (e.g., Rosmolodezh, the Federal Agency for Youth Affairs) can apply for a court order to have websites blocked, in which they typically appeal to one of Russia’s multiple laws governing Internet content. These laws often contain vague terms concerning the content they prohibit, including “нарушением установленного порядка” [violation of the established order], “нецензурную брань” [obscene language], “явное неуважение к… органам, осуществляющим государственную власть в Российской Федерации” [blatant disrespect for… bodies exercising state power in the Russian Federation], and “Пропаганда нетрадиционных сексуальных отношений и (или) предпочтений” [propaganda of nontraditional sexual relations and (or) preferences]. These laws have been used to justify political censorship of Internet content, particularly content critical of Putin or other Russian leadership, and to justify the restriction of the rights of LGBTIQ communities. Furthermore, according to a law which went into effect in February 2021, social media platforms are required to implement blocking proactively, as opposed to merely in response to court orders. Previous research studying Chinese social media censorship has shown how deferring blocking decisions to the private sector gives rise to inconsistent blocking across companies, with platforms often “overblocking” to ensure that they have covered all of the bases to avoid legal repercussions for insufficiently blocking content.

In this report, we study VKontakte [ВКонтакте], commonly abbreviated as “VK,” which is the most popular social media platform in Russia. VK is similar to Facebook in that it provides personal accounts, messaging, music and video hosting, and other community features. The platform is divided into three broad organizational categories: videos, communities or clubs, and people. VK has a complicated history concerning Russian censorship. The platform was founded in 2006 by Pavel Durov, who is also known for founding Telegram Messenger. Durov was dismissed as VK’s chief executive officer (CEO) in 2014, allegedly for failing to hand over the data of Russian political protesters to the Russian Federal Security Service (FSB), the country’s security agency. Durov was also targeted for failing to ban a VK community advocating for Alexei Navalny, a political opponent of Putin. Durov additionally claimed that the platform had come under “full control” of “Kremlin insiders” after the platform was sold to Alisher Usmanov, an oligarch loyal to Putin. In 2021, VK’s then-CEO Boris Dobrodeev resigned following the takeover of the company by state-owned companies. Analysts speculated that this state takeover could lead to “greater interference” by the Russian government.

In addition to criticism due to censorship, VK has been criticized by the digital security community as a platform that is unsafe for activists. This allegation was made on account of the personal information which it collects and due to VK joining the Register of Organizers of Distribution of Information in the Internet Network, a special list of platforms that must provide user data on request to the FSB and Russian police. Several waves of “exodus” of users from VK have been documented — the earliest one corresponding to the year of Durov’s departure — due to fears of government surveillance and legal harassment.

In this work, we are interested in measuring how VK implements political censorship in the context of the full-scale invasion of Ukraine that began in February 2022. Our research includes identifying what mechanisms VK uses to enforce censorship, what type of content is censored, and if or how this censorship applies to users outside of Russia. Specifically, we measure the accessibility of content on VK from different countries or vantage points to uncover instances of differential censorship, i.e., content which is censored in one region but not another. This allows us, for example, to determine which content is visible in Canada but not in Russia, and vice versa. In this report, we focus on comparing content availability from Russia, Ukraine, and Canada.

The remainder of this report is structured as follows: In the “Methodology” section, we detail our methods for uncovering VK’s differing censorship across countries, and, in “Experimental setup,” we explain the conditions and implementation details in which we executed these methods. Furthermore, in “Results,” we reveal our findings concerning the pervasive political and social censorship which VK applies to users in Russia. In “Limitations,” we review the limitations of our experiment, and finally, in “Discussion,” we discuss how our findings contribute to a greater understanding of Internet censorship in Russia and how Russian social media censorship compares to censorship elsewhere.

Methodology

This section details our methodology for measuring differential censorship on VK across Canada, Ukraine, and Russia. We conducted our research entirely without registration of or interaction with any user accounts on the VK platform. Instead, we tested access from network vantage points in the regions that we chose to compare, comparing the differences in what content was accessible in each region on VK’s website. This method ensures that we can conduct our testing without obtaining SIM cards or phone numbers, without worrying about account termination, and without the ethical concerns of creating or transmitting content over the platform.

To test for differential censorship on VK, and to ensure that we have a diverse sample of popular, easily enumerable topics to query, we sampled from Wikipedia article titles. We began by selecting the following six language editions of Wikipedia: Russian, Ukrainian, Belarusian, Georgian, Chechen, and Kazakh. We selected these language editions because VK is a Russian platform, and these are languages commonly spoken both within and in areas surrounding Russia. For each of these Wikipedia language editions, we independently sorted their articles by the total number of views that they had during January, February, and March 2023, sorting them in descending order of popularity. In our testing, we drew from across these six sorted lists in a round-robin fashion such that we tested the same number of articles from each language edition of Wikipedia.

In this work, we are interested in comparing the availability of VK content across Canada, a liberal democracy where we are based, as well as Russia and Ukraine, the countries with the first and third largest number of visitors to VK, respectively. VK allows for searching for videos, communities (also called groups or clubs), and people. As such, for each article title that we tested, we performed nine search queries simultaneously on the VK website. For each of the countries of Canada, Ukraine, and Russia and for each of the videos, communities, and people search targets, we queried the article title on that search target in that country. For each of these nine combinations, we recorded the number of search results for the query.

As we noticed that VK would on occasion spuriously report a smaller number of results than what it ordinarily would for a query, we implemented the following retest procedure. Twenty-four hours following an original query, we repeated searching the query from the same vantage point on the same search target. We then recorded whichever is greater of the number of results reported in the retest and the number of results reported in the original test.

Next, we sought to establish some threshold under which we could label the search result numbers from two different regions as suspiciously different. We noticed that comparing absolute numbers favored finding queries which had large numbers of results as suspiciously different, whereas comparing by the percent difference favored finding queries which had small numbers of results as suspiciously different. As such, to test whether the number of search results x from one region is suspiciously less than the number of results y from another region, we employ the following statistical heuristic. We noticed early on by analyzing the initial results between Canada and Ukraine that they were consistent modulo random fluctuation. In an early sample, we saw that in 132 of the tests there were eight which had a different number of results, each with a difference of one. As we did not know the variables affecting these random fluctuations (i.e., is or to what extent is the size of the fluctuation proportional to the number of results?), we chose 8/132 as a clear upper bound for the proportion of results we would find missing by chance. Using a one-sided chi-squared test, we then performed a test of difference in proportions, namely, the hypothesis that (xy) / x ≤ 8/132. If we reject this hypothesis, with p < 0.001 probability that such a difference in proportions could have arisen by chance, we conclude that x is suspiciously missing results compared to y.

For search result numbers that seemed to be suspiciously missing results, we further explored which content was missing from their results. Since VK only allows revealing up to 999 search results for a query, we limited our investigation to queries with fewer than 1,000 results. For any number of results x and number of results y, if x < 1,000, y < 1,000, and either x or y are suspiciously missing results compared to the other, we downloaded all of the search results for both x and y and recorded which were missing from each.

For each result missing from our case studies, to better understand why that video, community, or person was missing from the results for that country, we attempted to access that result from the region in which it was missing. For example, we attempted to access the result using both the desktop (vk.com) and mobile (m.vk.com) versions of the VK website. We recorded any error message or other block message which was displayed to the user. Specifically for missing video results, we attempted to access additional pages from that country. A video on VK can be associated with an individual poster, a community poster, or both. To better understand why videos are missing, we also attempted to access their individual and community posters and recorded any error or block message displayed on their pages. We attempted access to these pages using both the desktop (vk.com) and mobile (m.vk.com) versions of the site.

Experimental setup

We implemented the above methodology in Python using the aiohttp and SciPy modules and executed the code on an Ubuntu 22.04 Linux machine. We performed this experiment from April 17 through May 13, 2023. Our Canadian measurements were performed from a University of Toronto network. Our Russian and Ukrainian measurements were performed through WireGuard tunnels, as provided by a popular VPN service offering Russian and Ukrainian vantage points. In light of VK being blocked to varying extents on most Ukrainian networks due to a ban of the site, we confirmed that our Ukrainian vantage point had access to VK before performing our experiments (see Appendix A for our analysis of Ukraine’s ban of VK).

Results

During our testing period, we tested on VK the accessibility of the titles of the top 127,187 articles in each of the Russian, Ukrainian, Belarusian, Georgian, Chechen, and Kazakh language Wikipedias. Together, this sample set comprised 708,346 unique article titles. By measuring what content was blocked in VK search queries for these titles, we found differential blocking of videos in Canada, as well as of videos, communities, and personal accounts in Russia, although the motives for the blocking of videos in Canada versus Russia appeared starkly different, as we explain below. We also found that in Russia, search query results for communities and people were censored by keyword, whereas in Canada we found no such filtering.

Notably, we found no in-platform differential blocking carried out by VK in Ukraine compared to Canada or in Ukraine compared to Russia. Since we did not discover any content that was accessible in Canada or Russia but that was inaccessible in Ukraine, in the remainder of this section, we will focus on comparing differential blocking in Russia compared to Canada and vice versa. We first provide an overview of our findings and detail the different blocking mechanisms that VK uses. We then use data analysis techniques to better understand the blocked content that we discovered, such as what type of content was blocked, what events precipitated its blocking, or what legal justifications did VK cite in its blocking.

Blocking overview and mechanisms

From our results, we were able to infer that VK used multiple methods of blocking. The primary method seen was the blocking or removal of certain search results. For content missing in Canada, for example, we saw no missing personal account results. Nine communities were missing from Canadian search results, but the results themselves were still accessible in Canada by typing the URLs for the communities’ pages, and there did not appear to be anything about them that would suggest why they might be removed from search results. As such, we believe that they were false positives in that they were missing from the search results for completely benign reasons such as different load balancers or caching servers possessing inconsistent views of the same data. Aside from a small number of what also seemed to be false positives, all of the 2,613 videos tested in Canada that were missing from the search results showed either a “This video is unavailable in your country” or a “Video sound unavailable” block message. These videos appeared to all be popular sports, music, and other entertainment videos posted by ordinary users and, when considering the explanation given in their block messages, these videos were likely blocked in Canada for copyright infringement. These videos, however, were still available in Russia. Our hypothesis is that this differing treatment of copyright-infringing content could be explained by the overall inconsistent way that VK enforces copyright law across multiple regions.

Ukraine Canada Russia
Videos None observed Copyright infringement targeting videos When a community or person posting a video is blocked, that video is also blocked
Communities None observed None observed (1) LGBTIQ keyword-based blocking of search queries for communities and (2) political blocking of communities
People None observed None observed (1) LGBTIQ keyword-based blocking of search queries for people and (2) political blocking of people

Table 1: For each region, for each content type, the methods of blocking which we discovered.

We observed more diverse methods and motivations behind content being unavailable in Russia (see Table 1 for a summary). When searching for communities and people, we observed that VK disabled search results if the search query contained certain LGBTIQ-related keywords (see Figure 1 for an illustration and Table 2 for a list of the keywords which we discovered triggering filtering). While it applied for searches for communities and people, this keyword-based censorship of search queries did not appear to apply to searches for videos.

Searching for “lgbt” in Russia blocked all results for containing the keyword “lgbt”.
Figure 1: Searching for “lgbt” in Russia blocked all results for containing the keyword “lgbt”.
Keyword English translation
gay gay
LGBT LGBT
Геи gay
Гей gay
ЛГБТ LGBT
ЛГБТК LGBTQ
Лесбиянка lesbian
Трансгендер transgender
Фембой femboy

Table 2: Keywords censoring search queries for communities and people in Russia.

Aside from VK’s keyword-based filtering of searches for community and personal accounts, VK also directly blocked individual community and personal accounts, which also hides them from search results and displays a block message when viewing the account’s page. In fact, blocking community and personal accounts appears to be VK’s primary method of censoring videos in Russia. Outside of 134 videos which displayed no block message and which we believe to be false positives, the remaining 94,942 videos missing from search results showed a block message on the desktop version of VK such as “This video is unavailable because its creator has been blocked.” We confirmed that all of these videos were blocked due to the community or person who had posted the video being blocked, because when we attempted to view the community or the account that posted these videos, we received a block message that mentioned a court order for the blocking.

Example of blocked video on desktop version of VK.
Figure 2: Example of blocked video on desktop version of VK.

To justify the blocking of communities and personal accounts in Russia, we observed 336 unique VK block messages citing 303 different legal case numbers. An example of such a block message is as follows: “Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Центральный районный суд г. Хабаровска – Хабаровский край) от 10.08.2015 № 2-5951/2015” [This material was blocked in the territory of the Russian Federation on the basis of the decision of the court / authorized federal executive body (Central District Court of Khabarovsk – Khabarovsk Territory) dated 10.08.2015 No. 2-5951/2015]. In instances where information is publicly available, these legal cases appear to be takedown requests filed by Russian prosecutors or other actors, which appeal to varying Russian laws for justification. For example, in the case cited in the aforementioned block message, the Russian prosecutor appeals to Article 4 of a Russian law “On Mass Media” to ask the court to order the takedown of content on VK which allegedly uses obscene language to refer to Vladimir Putin.

: An example of a blocked community page citing a legal justification on the desktop version of VK
Figure 3: An example of a blocked community page citing a legal justification on the desktop version of VK; the text reads “Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 13.08.2022 № 27-31-2022/Иф-10643-22” [This material has been blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/IF-10643-22 dated 13.08.2022].
While the desktop version of VK consistently showed the message “This video is unavailable because its creator has been blocked” when a community or person posting the video was blocked, we found that, on the mobile version, if the video was posted by a community blocked in Russia (as opposed to a personal account), then it shows the block message of the blocked community instead. It is unclear why this inconsistency exists.

We also rarely observed other error messages which were not block messages such as “Please sign in to view this video,” “Access to this video has been restricted by its creator,” and “This page has either been deleted or not been created yet.” These messages were not indicative of blocking but rather would occur if content were deleted or restricted at the poster’s discretion during our testing process. Therefore, we did not consider such content with these error messages to be blocked. See Table 3 for a breakdown of the types of error messages which we observed.

Error message type Example(s) Observed in? Block message?
Court ordered blocking Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Центральный районный суд г. Хабаровска – Хабаровский край) от 10.08.2015 № 2-5951/2015” [This material was blocked in the territory of the Russian Federation on the basis of the decision of the court / authorised federal executive body (Central District Court of Khabarovsk – Khabarovsk Territory) dated 10.08.2015 No. 2-5951/2015] Russia Yes
The community or person posting this video has been blocked “This video is unavailable because its creator has been blocked” Russia Yes
Copyright infringement “This video is unavailable in your country”

“Video sound unavailable.”

Canada Yes
Permission denied “Please sign in to view this video”

“Access to this video has been restricted by its creator”

Canada, Russia No
Deleted “This page has either been deleted or not been created yet.” Canada, Russia No

Table 3: Breakdown of types of error messages, examples of them, where they have been observed, and whether they are block messages identifying differential blocking.

While earlier we found blocked communities and personal accounts due to their results missing in search results, we can also find them from looking to see who posted blocked videos. Working backward from blocked videos to find the blocked communities or personal accounts who posted them, we found an additional 826 communities and 768 personal accounts blocked in Russia. Together with 804 blocked communities and 19 blocked personal accounts directly missing from community and people search results, we found 1,569 unique communities and 787 unique personal accounts blocked in Russia (see Table 4 for a summary of all content blocked).

Canada Russia
Method of discovery Missing in results (unique) Blocking a video (unique) Total (unique) Missing in results (unique) Blocking a video (unique) Total (unique)
Videos 2,613 N/A 2,613 94,942 N/A 94,942
Communities 0 0 0 804 826 1,569
People 0 0 0 19 768 787

Table 4: For each region, for each content type, the number of blocked instances of that content type in that region. For communities and personal accounts, we further break down the number we discovered from their absence in search results versus for having posted a blocked video.

While we have given a brief overview of the types of blocking methods on VK, as well as the amount of content subject to each type of blocking across different regions, in the remainder of this section we will perform a deeper analysis of the type of content blocked on VK. We will first characterize blocked videos in Canada and Russia according to the search queries from whose results they were missing, the posters of the blocked videos, and a random sampling of the blocked video contents themselves. Then, we analyze the block messages and legal justifications which VK communicates to the user upon attempting to view blocked content.

Analysis of blocked videos

In this section, we characterize blocked videos according to the search queries from whose results they were missing, the posters of the blocked videos, and a random sampling of the contents of the blocked videos themselves.

What search queries discovered blocked videos?

Recall that, during our testing, we took popular Wikipedia article titles from multiple language editions of Wikipedia and used them to search for videos, communities, and personal accounts on the VK website, to see if and to what extent these search results were blocked in one region versus another. In this section, we are specifically interested in the search queries that led to the discovery of large numbers of blocked videos, as such queries can signal the type of content blocked on VK. We call such queries productive queries.

Videos blocked in Russia

Among the top ten most productive queries (i.e., those leading to the discovery of the greatest number of blocked videos), we see that most are related to the Ukraine war (“Учасники російсько української війни Ш” [Participants of the Russian-Ukrainian war], “Пропаганда війни в Росії” [Propaganda of War in Russia]) and international bodies that are involved in mediating the conflict (“Генеральна Асамблея ООН” [UN General Assembly], “Міжнародний суд ООН” [International Court of Justice]).

The most productive queries in Russia can be understood to be indirectly related to the war such as “Secret Invasion,” derived from a Wikipedia article for a Marvel comic series turned TV show, but nevertheless uncovering blocked content related to the Ukrainian “invasion” more broadly, and “Катэгорыя 24 лютага” [Category: February 24], the title of a Wikipedia article listing holidays on February 24 but which is also the day of Russia’s full-scale invasion of Ukraine. There are also productive queries related to Ukraine more generally such as “Поліський район” [Polskiiy District], a former administrative region in Kyiv Oblast, and the anthem of Ukrainian President Volodomyr Zeleskyys home town “Кривий Ріг” [Kryvyi Rih]. We also found one term related to a news service in Belarus (“БелаПАН” [BelaPAN]), as well as one term which appeared unrelated to the conflict (“Чорні троянди” [Black Roses]) but upon closer inspection we discovered that it was the name of a blocked Ukrainian pro-military group which had posted a large number of videos.

Rank Query Translation Description of Page Query Language # of Videos Discovered Blocked Total Results
1 Чорні троянди Black Roses Turkish rose plants Ukrainian 493 904
2 БелаПАН BelaPAN Private news agency in Belarus Belarusian 476 843
3 Кривий Ріг моє місто Kryvyi Rih – my city Anthem of Ukrainian town Ukrainian 450 493
4 Пропаганда війни в Росії Propaganda of War in Russia Description of Russian War Propaganda Ukrainian 449 625
5 Генеральна Асамблея ООН UN General Assembly United Nations General Assembly Ukrainian 424 798
6 Міжнародний суд ООН
International Court of Justice
United Nations International Court of Justice Ukrainian 419 807
7 Учасники російсько української війни Ш Participants of the Russian-Ukrainian war Description of the conflict. Ukrainian 346 453
8 Сакрэтнае ўварванне Secret invasion Marvel Mini Series Belarusian 322 358
9 Поліський район Polskiiy District Former region of Ukraine in Kyiv Oblast Ukrainian 314 356
10 Катэгорыя 24 лютага Category: February 24th An index of holidays on February 24th Belarusian 293 363

Table 5: The ten most productive queries in Russia, i.e., those which we tested which discovered the most blocked videos in Russia.

Videos blocked in Canada

In contrast to the test results from Russia, the most productive queries in Canada did not deal with the Ukraine war but rather sports, music, and geographic locations. Most of the queries (six in ten) are related to sports including: The Davis Cup (in Russian and Belarusian), World Figure Skating Championships, and three different soccer players (Ciro Immobile, Alejandro Gomez, and Duván Zapata). There are also queries related to music (K Ci & JoJo and Beatles Bootleg Recordings) and geographic locations (Locust and Charleroi). The queries that led to blocked content in Canada are different from those in Russia and are more focused on entertainment rather than current events.

Rank Query Translation Description of Page Query Language # of Videos Discovered Blocked Total Results
1 Локаст Locust City in the United States Chechen 161 284
2 Кубок Дэвиса Davis Cup The Davis Cup tennis trophy Russian 123 457
3 Иммобиле Чиро Ciro Immobile Italian soccer player Russian 78 194
4 Шарлеруа Charleroi City in Belgium Ukrainian 59 285
5 K Ci JoJo K Ci & JoJo Musicians Georgian 57 251
6 Чемпионат мира по фигурному катанию 2023 World Figure Skating Championships 2023 Figure skating event Russian 55 157
7 Кубак Дэвіса Davis Cup The Davis Cup tennis trophy Belarusian 54 456
8 The Beatles Bootleg Recordings 1963 The Beatles Bootleg Recordings 1963 Compilation Beatles album Georgian 53 253
9 Алехандро Гомес Alejandro Gomez Argentine soccer player Ukrainian 52 258
10 Сапата Дуван Duván Zapata Colombian soccer player Russian 52 77

Table 6: The top ten most productive queries in Canada, i.e., those which we tested which discovered the most blocked videos in Canada.

What languages are blocked videos in?

In the previous section, we looked at the search queries that led to the discovery of large numbers of blocked videos. In this section, we perform a similar analysis but based on which language edition of Wikipedia the search query was from. Our purpose is to see which Wikipedia language edition’s article titles led to the largest numbers of blocked videos. We do this to better understand the languages of the video content blocked on VK.

Videos blocked in Russia

Among videos blocked in Russia, we find that queries from the Ukrainian language Wikipedia discovered the largest share (61%) of blocked video results, followed by Belarusian (36%), with Russian at a distant third (1%). All remaining languages (Kazakh, Chechen and Georgian) accounted for less than 0.3% each. Seeing a disproportionately large amount of Ukrainian content blocked is surprising because, after VK’s 2017 blocking in Ukraine, average daily visits from Ukrainian users dropped from 54% of Ukrainian Internet users to only 10% of Ukrainian Internet users visiting VK on a given day. Moreover, despite VK being a Russian social media platform, Russian language queries in Russia led to the discovery of only a small share of blocked videos (1.33%). However, these findings may merely speak to the effectiveness of VK’s censorship regime at disincentivizing Russians and, therefore, largely Russian speaking users to engage in censored speech. Furthermore, the social cost of being blocked in Russia is greater for Russians than for those outside of Russia, again further disincentivizing sensitive political speech for users in Russia.

For videos found blocked in Russia, the number of videos discovered via queries originating from which language edition of Wikipedia.
Figure 4: For videos found blocked in Russia, the number of videos discovered via queries originating from which language edition of Wikipedia.
Language # of Queries Share
Ukrainian 148,313 61.56%
Belarusian 87,521 36.33%
Russian 3,205 1.33%
Kazakh 854 0.35%
Chechen 760 0.32%
Georgian 264 0.11%

Table 7: For videos found blocked in Russia, the number of videos discovered via queries originating from which language edition of Wikipedia.

Videos blocked in Canada

In contrast to Russia, which blocked a large share of videos queried using titles article from the Ukrainian language Wikipedia, the language composition of the queries that led to the discovery of blocked videos in Canada is markedly different. Among the videos blocked in Canada, the Russian language is most represented in our data set with 43.44% share of results, followed by Kazakh (20.47%) and Georgian (13.89%) All remaining languages (Ukrainian, Chechen, and Belarusian) have less than a ten percent share. In Canada, Russian is far more represented (43.34% in Canada compared to 1.33% in Russia). This finding is more in line with expectations as VK is a Russian social media platform with a predominantly Russian user base and, therefore, such a platform would contain more Russian language content in requirement of moderation than any other language.

These findings reflect VK’s differing motivations in blocking videos in Russia versus Canada. In Russia, VK appears motivated to block content that primarily contains certain political views, which are often expressed by Ukrainian and Belarusian speakers. However, in Canada, VK blocks content that contains copyright infringement, which we would expect to be committed by speakers of different languages of equal frequency. As VK is a Russian platform we therefore would expect to see higher absolute numbers of Russians moderated due to their greater representation on the platform.

For videos found blocked in Canada, the number of videos discovered via queries originating from which language edition of Wikipedia.
Figure 5: For videos found blocked in Canada, the number of videos discovered via queries originating from which language edition of Wikipedia.
Language # of Queries Share
Russian 1,426 43.44%
Kazakh 672 20.47%
Georgian 456 13.89%
Ukrainian 315 9.59%
Chechen 237 7.22%
Belarusian 177 5.39%

Table 8: For videos found blocked in Canada, the number of videos discovered via queries originating from which language edition of Wikipedia.

Who posted blocked videos?

Next, we review who posted the largest share of the most blocked content that we discovered on VK to get a sense of the individuals or entities whose content is most affected by VK’s blocking and what they are posting. We divide our examination into two different user types per country: videos that were posted by personal accounts and those posted by communities. It should be noted that some community pages may have the branding of a company but it is not always clear if these are officially operated accounts. VK offers a verification system for companies and brands, but verification is optional, and some companies may be unaware or unwilling to go through this process. In our discussion, we will mention whether a company is verified.

Videos blocked in Russia posted by personal accounts

From examining the videos that were blocked in Russia, we discovered 1,429 personal accounts that were blocked in the country. Among these, one poster named “Oleg Skripnik” accounts for an outsized portion (37%) of the blocked videos that we discovered, followed by “Daryna Ivaniv” (12%) and “Podryv Ustoev” (4%). These top three posters account for 53% of all videos that we discovered were posted by blocked personal accounts, underscoring how a small number of posters are overrepresented in terms of video blocking. Among the personal accounts that were blocked, the majority of these accounts post political content only occasionally and cannot be described as accounts primarily used for activism. A few personal accounts that were blocked seem to belong to the Ukrainian military and are still active. This finding shows that, regardless of wide criticism of VK as an insecure, pro-Russian platform, and, in spite of its blocking in Ukraine (see Appendix A), it is still used by many Ukrainians including those currently on the frontlines.

Rank Profile URL Account Name Content Posted # of Videos Discovered Blocked Share
1 https://vk.com/skripoleg Oleg Skripnik Ukraine war content 19,061 37.93%
2 https://vk.com/id576554975 Daryna Ivaniv Ukraine war content 6,328 12.59%
3 https://vk.com/s.krupko63 Podryv Ustoev Ukraine war content 2,131 4.24%
4 https://vk.com/id613313976 Daryna Ivaniv Ukraine war content 1,228 2.44%
5 https://vk.com/id229910131 Masha Vedernikova Ukraine war content 1,193 2.37%
6 https://vk.com/id303073458 Boris Suslenskiy Ukraine war content 1,005 2.00%
7 https://vk.com/id157885457 Lyubov Platonova Ukraine war content 770 1.53%
8 https://vk.com/id293387897 Vasily Zhazhakin Ukraine war content 690 1.37%
9 https://vk.com/id129054771 Igor Zachosa Ukraine war content 604 1.20%
10 https://vk.com/id22401146 Sergey Derkach Ukraine war content 568 1.13%

Table 9: The ten personal accounts which we discovered with the most blocked videos in Russia.

In addition to the 1,429 blocked personal accounts found via blocked videos, when we directly searched different article titles within the “People” category, we found an additional 19 blocked personal accounts due to being missing from our search query results. These additional accounts are all related to the Praviy Sektor, which is a Ukrainian nationalist group, except one account titled “Femboy Developer” (see Table 10).

Profile URL Title
https://vk.com/id315585161 Praviy-Sektor Zakarpattya
https://vk.com/id287586663 Praviy-Sektor Shishaki-Ray-Org
https://vk.com/id241957654 Pravy Sektor
https://vk.com/id253532397 Praviy-Sektor Peremishlyani
https://vk.com/id303491180 Pravy Sektor
https://vk.com/id257667002 Praviy-Sektor Praviy-Sektor
https://vk.com/id459902176 Pravy Sektor
https://vk.com/id247366231 Praviy-Sektor Chechelnik
https://vk.com/ukrop24 Praviy-Sektor Dikanka-Rayorg
https://vk.com/drogobych_ps Drogobich Praviy-Sektor
https://vk.com/id244694134 Pravy Sektor
https://vk.com/id289687245 Praviy-Sektor Kolomia
https://vk.com/id248075744 Pravy Sektor
https://vk.com/id297537442 Pravy Sektor
https://vk.com/id284720470 Praviy-Sektor Karlivka
https://vk.com/pszak Pravy Sektor
https://vk.com/id406055235 Praviy-Sektor Kolomia
https://vk.com/id366480496 Pravy Sektor
https://vk.com/femboy_dev Femboy Developer

Table 10: The profiles of the additional 19 blocked personal accounts which we discovered from their absence in search queries.

Videos blocked in Russia posted by communities

We discovered 826 communities blocked by VK in Russia from the videos they posted, which were also blocked in Russia. Ten of these blocked communities are ranked below in Table 11 by the number of videos which we discovered blocked that were posted by them. These communities include those focused on news (“Ploscha”) and Ukrainian and Belarusian patriotic communities (“My Country Belarus,” “My Ukraine,” and “Patriots of Ukraine”). One account was of a regional Ukrainian television station Channel 1 – Urban First [Канал 1 – Первый Городской]. There are also accounts of oppositional media focused on Belarus, such as Belsat TV, Radio Svoboda and European Radio for Belarus. Among these, Belsat TV and Radio Svoboda are state funded by Poland and the United States, respectively, while European Radio for Belarus is independent. Of these accounts, only the account of European Radio for Belarus is “verified” through VK, although all of these groups post content from their respective community pages.

Rank Profile URL # of Videos Discovered Blocked Share Content Posted Account Type
1 https://vk.com/ploshcha 23,265 18.00% Belarus content News Poster
2 https://vk.com/euroradio 18,667 14.44% Verified account of European Radio for Belarus, nonprofit media for Belarus Media
3 https://vk.com/belsat_tv 10,738 8.31% Belsat TV, Polish state funded media for Belarus Media
4 https://vk.com/majabelarus 8,337 6.45% Belarus content Patriotic Community
5 https://vk.com/radiosvaboda 7,354 5.69% Radio Svoboda Belarus, US state media for Belarus Media
6 https://vk.com/patrioty 4,320 3.34% Ukraine war content Patriotic Community
7 https://vk.com/we.patriots 3,681 2.85% Ukraine war content Patriotic Community
8 https://vk.com/1tv_kr_ua 2,564 1.98% Ukrainian Regional Television Media
9 https://vk.com/ua.insider 2,562 1.98% Ukraine war content Nationalist
10 https://vk.com/war_for_independence 2,265 1.75% Ukraine war content Patriotic Community

Table 11: The ten communities which we discovered with the most blocked videos in Russia.

Outside of these top ten communities, there are other communities blocked, including Ukrainian media outlets such as Hromadske [Громадське] and BBC News Ukrainian, and a Belarusian opposition newspaper Nasha Niva [Наша Ніва]. The verified community of the team of Alexei Navalny is also blocked. We also found sport-related communities such as “FC Shakhtar” (a fan page of the Football Club Shakhtar from Donetsk) and By.Tribuna.com (the Belarusian branch of an international sport media Tribuna) among the results.

In addition to the 826 blocked communities which we found via their blocked videos, when we directly searched different article titles within the “Communities” category, we found an additional 804 blocked communities due to being missing from our search query results. We present the ten queries which led to the discovery of the most blocked communities in Russia in Table 12.

Rank Language Query Translation Types of Communities # of Communities Discovered Blocked Share
1 Russian Неодимовый магнит Neodymium magnet Sale of magnets to tamper with gas and water meters. 72 8.94%
2 Russian Европейская хартия местного самоуправления European Charter of Local Self-Government Pro-USSR regionalist groups 54 6.71%
3 Kazakh Сыпатай Саурықұлы Sypatai Saurykuly Sports wagering communities (query unrelated) 50 6.21%
4 Russian Фиктивный брак Fictitious marriage Communities to arrange fake marriages 42 5.22%
5 Belarusian Пуцін хуйло Putin is a dick Anti-Putin groups 38 4.72%
6 Ukrainian Кирило Лукаріс Kyrylo Loukaris Pill buying/selling (query unrelated) 38 4.72%
7 Ukrainian Національний корпус National Corps Nationalist communities 36 4.47%
8 Russian Партия националистического движения Nationalist Movement Party Nationalist communities 31 3.85%
9 Georgian Путин хуйло Putin is a dick Anti-Putin groups 27 3.35%
10 Chechen СагӀсена Sarcenas Ozempic sales (query unrelated) 28 3.48%

Table 12: The ten queries which we tested which discovered the most blocked communities in Russia.

The query, which led to the discovery of the most censored communities, is related to the sale of Neodymium magnets (“Неодимовый магнит”), accounting for over 8% of the communities which we discovered blocked. The content of these community pages indicates that these are rare earth magnets that are marketed as being able to tamper with water and gas meters. One group’s description claims that using these magnets for this purpose are prohibited by law, which suggests the lack of consistent legal enforcement in these communities. Many of the other search queries are also related to potential scams, such as the arranging of fake marriages (“Фиктивный брак”), sports wagering, pill sales, and diet supplements. There are blocked communities of racist and nationalist groups present as well. There are also communities related to pro-USSR regionalist groups (e.g., Community of the KNVR of the Udmurt Region [Община КНВР Удмуртского Региона]). Finally, many of the queries and their blocked groups are critical of the government and insulting of Putin, as many are titled with the anti-Putin slogan “Пуцін хуйло” which translates to “Putin is a dick.”

The blocked communities appear to have a different content focus compared to blocked videos. Whereas blocked video content in Russia is largely related to the Ukraine war and Belarus, blocked communities are focused on potential scams. There is some crossover, however, as racist, nationalist content is blocked in both videos and communities within Russia.

Videos blocked in Canada posted by personal accounts

In contrast to Russia, within the top ten personal accounts that posted the most blocked videos in Canada, all except one primarily posted music content (see Table 13). There were no videos containing political or current events that were posted by the top ten posters in Canada. This result is, again, a departure from what was seen in Russia. Hence, VK in Canada focuses more on blocking entertainment content for what is most likely copyright-related justifications.

Rank Profile URL Account Name Content Posted # of Videos Discovered Blocked Share
1 https://vk.com/ig.linevich Igor Linevich Music 182 25.63%
2 https://vk.com/id474426680 Vadim Popov Music 79 11.13%
3 https://vk.com/chertoritsky Sergey Chertoritsky Music 30 4.23%
4 https://vk.com/walema Stary Ded TV 21 2.96%
5 https://vk.com/step1972 Andrey Krivopishin Music 13 1.83%
6 https://vk.com/blogthe The Blog Music 7 0.99%
7 https://vk.com/sergeylzar Sergey Lazarikhin Music 7 0.99%
8 https://vk.com/s.pantsyrny Slava Pantsyrny Music 6 0.85%
9 https://vk.com/id3788507 Alexander Kukhtin Music 5 0.70%
10 https://vk.com/id243891102 Lasha Ujmachuridze Music 4 0.56%

Table 13: The ten personal accounts which we discovered with the most blocked videos in Canada.

Videos blocked in Canada posted by communities

This trend of the blocking of entertainment content holds in Canada for communities which posted videos that were blocked in Canada. Six of the ten blocked community video posters focused on sports, three on music, and one on cartoons. There is a focus on Russian media producer channels as well, including TV (Tele Sport, Okko Sport, and Match Premier) and radio (OMSK 103.9 FM). This content is different from blocked community posters in Russia which does include media but focused primarily on politics and current events (Belsat, Radio Svoboda, and Euradio).

Rank Content Poster # of Videos Discovered Blocked Share Content Posted Account
1 https://vk.com/telesport 533 27.57% Sports Russian sports television “Tele Sport”
2 https://vk.com/serieavk 313 16.19% Sports Community for Italian Soccer League “Serie A”
3 https://vk.com/silatv 206 10.66% Sports Russian sports television “Tele Sport”
4 https://vk.com/locasta 161 8.33% Music “Locasta” street dancing clips
5 https://vk.com/okkotennis 119 6.16% Sports Russian TV “Okko Sport” tennis community
6 https://vk.com/okkosport 103 5.33% Sports Russian TV sports station “Okko Sport”
7 https://vk.com/sibiromsk 39 2.02% Music Russian radio station OMSK 103.9 FM
8 https://vk.com/2pac_one_nation 30 1.55% Music Fan community for musician Tupac Shakur
9 https://vk.com/matchpremier 29 1.50% Sports Russian sports television station “Match Premier”
10 https://vk.com/public207473513 26 1.35% Cartoons Community for “Davv Productions”

Table 14: The ten communities which we discovered with the most blocked videos in Canada.

What content is in blocked videos?

Due to the high number of blocked videos which we discovered, it would be infeasible for us to watch and categorize all the content. Instead, to capture the general themes of blocked content, we randomly sampled 30 videos that were blocked in Russia and 30 videos that were blocked in Canada, watched them, and categorized them according to their content.

Videos blocked in Russia

Among the 30 sampled blocked videos in Russia, we find that the largest share (43%) are videos related to the Ukraine war. The videos reviewed include war footage, demonstrations of military ordnance, interviews with service members, and talk shows discussing the war. The next largest category of blocked content concerns videos related to Belarus (26%), which include videos of protests, as well as news coverage of deaths, detentions, and tragedies. The third most observed category of blocked content is non-war Ukrainian content (13%), which includes news coverage of economic issues and nationalist marches.

Categories of blocked videos in Russia from our randomly selected sample.
Figure 6: Categories of blocked videos in Russia from our randomly selected sample.
Missing in Russia Category Notes
https://vk.com/video-36069860_166138550 Belarus Protest around death of Belarussian in pretrial detention
https://vk.com/video-36069860_456240026 Belarus Debate between a Belarusian opposition leader Dashkevich and undercover police
https://vk.com/video155142793_456347462 Belarus Moving Iskander-2 missiles to Belarus
https://vk.com/video-36069860_456252718 Belarus Radio Svoboda coverage of detained Belarussian photographer
https://vk.com/video-36069860_456246093 Belarus TV coverage of the 1999 stampede tragedy in Minsk.
https://vk.com/video-22639447_456254462 Belarus Death of Belarusian scientist Boris Kit
https://vk.com/video-22639447_456264287 Belarus Message from Minsk Workers to Lukashenko’s Trade Union
https://vk.com/video-72572911_456243223 Criminal Ukrainian anti-corruption TV program
https://vk.com/video613313976_456244390 History Educational audio program describing Ukrainian writer and poet Borys Antonenko-Davydovych
https://vk.com/video-23282997_159220433 History Educational video about judging in Middle ages Lithuania and Ukraine
https://vk.com/video-18162618_456243561 Sports Interview with a Shakhtar Donetsk player.
https://vk.com/video-155655277_456239927 Sports Ukrainian first league match between FC Hirnyk-Sport and FC Prykarpattia
https://vk.com/video576554975_456272785 Ukraine (Non-War) Interfax press conference regarding the “Mask-Show-Stop” law in pretrial detention.
https://vk.com/video-24262706_161238424 Ukraine (Non-War) Footage of UPA (Nationalist) march in Kiev
https://vk.com/video155142793_456308212 Ukraine (Non-War) Espreso TV coverage of tax evasion enforcement
https://vk.com/video374267542_456248883 Ukraine (Non-War) Coverage around spending by Speaker of the Verkhovna Rada of Ukraine Andriy Parubiy
https://vk.com/video-11019260_456247195 Ukraine War Ukrainian AF Russian Legion, military
https://vk.com/video155142793_456272568 Ukraine War DShK machine guns in Luhansk region competition, military
https://vk.com/video-93448512_456240035 Ukraine War War footage Ukrainian soldiers inspect destroyed Russian positions
https://vk.com/video715174916_456239961 Ukraine War Political talk show touching topics in Russia and Ukraine
https://vk.com/video155142793_456332156 Ukraine War Commentary about Ukraine and Russia
https://vk.com/video-5063972_456241071 Ukraine War Interview with soldiers in Ukrainian village of Yasinuvata
https://vk.com/video549895_456239793 Ukraine War Commentary about the Ukrainian war
https://vk.com/video62649817_456252177 Ukraine War News coverage about Ukrainian war, Bucha massacre and Kremlin actions
https://vk.com/video-5063972_118384509 Ukraine War Promotional video about Ukrainian marine unit
https://vk.com/video535771132_456240850 Ukraine War Ukrainian security service intercept of battlefield communications.
https://vk.com/video11405356_456239226 Ukraine War A video with a fake “horoscope” that recommends to donate to Ukrainian army
https://vk.com/video-72589198_456240902 Ukraine War Interview with Ukrainian service member.
https://vk.com/video-23502694_456244444 Ukraine War Video of Ukrainian Armed Force tanks
https://vk.com/video-54899733_456240014 USA Biden and Obama at Medal of Honor ceremony.

Table 15: Categories of blocked videos in Russia from our randomly selected sample.

Videos blocked in Canada

We also randomly sampled 30 videos blocked in Canada and categorized their content. In contrast to the categories blocked in Russia, which were largely related to the Ukraine war and Belarus, blocked content in Canada is more related to entertainment, specifically sports (57%), music (40%), and television programming (3%). These categories reflect that the primary motive around blocking in Canada is related to copyright enforcement. There is a complete absence of any political, news, or current events content blocked in Canada, which are categories that dominate the sample of blocked videos in Russia. These findings again indicate that the aim of censorship within Canada is very different from within Russia, with the former being focused on copyright and the latter on news, current events, and politics.

Categories of blocked videos in Canada from our randomly selected sample.
Figure 7: Categories of blocked videos in Canada from our randomly selected sample.
Video Missing in Canada Category Notes
https://vk.com/video-29412860_456240167 Music Radio broadcast
https://vk.com/video2560911_153209689 Music Music video
https://vk.com/video177634113_456239296 Music Music video
https://vk.com/video-41138955_456239155 Music Music video
https://vk.com/video179151037_456245514 Music Music video
https://vk.com/video-175484418_456239085 Music Music video
https://vk.com/video13944339_456240104 Music Music video
https://vk.com/video-116705_456241000 Music Music video
https://vk.com/video5958883_105821112 Music Music video
https://vk.com/video7238152_456244997 Music Music video
https://vk.com/video179151037_456241049 Music Music video
https://vk.com/video-58492936_456239429 Music Music video
https://vk.com/video-151498735_456245443 Sports Soccer
https://vk.com/video-198813611_456240402 Sports Soccer
https://vk.com/video-141682278_456244465 Sports Soccer
https://vk.com/video-198813611_456240223 Sports Soccer
https://vk.com/video-141682278_456249621 Sports Soccer
https://vk.com/video-198813611_456239230 Sports Soccer
https://vk.com/video-141682278_456249560 Sports Soccer
https://vk.com/video-202752058_456239622 Sports Tennis
https://vk.com/video-141682278_456240046 Sports Soccer
https://vk.com/video-141682278_456245917 Sports Soccer
https://vk.com/video-202752058_456239667 Sports Tennis
https://vk.com/video-141682278_456241114 Sports Soccer
https://vk.com/video-141682278_456241024 Sports Soccer
https://vk.com/video-151498735_456247745 Sports Soccer
https://vk.com/video-198813611_456240557 Sports Soccer
https://vk.com/video-198813611_456240867 Sports Soccer
https://vk.com/video-198813611_456239868 Sports Soccer
https://vk.com/video-156580570_456241205 TV Beating Again (순정에 반하다), Season 1, Episode 8

Table 16: Categories of blocked videos in Canada from our randomly selected sample.

Block messages communicated to users

In this section, we review the block messages that are communicated to users when they try to visit blocked content pages in Russia and Canada. We find that all content that is blocked in one region but available in another presents a message to users that explains the reason why the content is unavailable.

We discovered 336 unique messages communicated to users when they try to access blocked content in Russia. All but one message cites a Russian court order as a justification for the block. The one message observed that does not cite a Russian court order is the more general message, “This video is unavailable in your country,” which affected five videos. The remaining 335 messages are in Russian and they explain in a similar format that the video is blocked in the Russian Federation, as well as mention who requested the block, and the associated case number and date.

Despite there being over three hundred block messages which we discovered, the ten most frequently observed messages account for a large majority (77.15%) of blocked videos. The message that we observed justifying the largest number of blocked videos (33,252 videos or 35%) was requested by the General Prosecutor’s Office, citing case number “27-31-2020/Ид2145-22,” and dated February 24, 2022. Although we were unable to find the text of this court decision, this same case number was cited by the Russian communications regulator, Roskomnadzor, to block 6,037 websites, and, given its timing, we presume that it is related to Russia’s full-scale invasion of Ukraine.

Rank Message Translated Message # of Videos Discovered Blocked Share Cumulative Share
1 Этот материал заблокирован на территории РФ согласно требованию Генеральной прокуратуры Российской Федерации от 24.02.2022 № 27-31-2020/Ид2145-22 This material is blocked on the territory of the Russian Federation in accordance with the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2020/Id2145-22 dated 24.02.2022 33,252 35.02% 35.02%
2 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 12.03.2015 № 27-31-2015/Ид831-15 This material is blocked on the territory of the Russian Federation on the basis of the request of the General Prosecutor’s Office of the Russian Federation from 12.03.2015 № 27-31-2015/Id831-15 11,943 12.58% 47.60%
3 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры РФ от 24.02.2022 № 27-31-2020/Ид2145-22 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2020/Id2145-22 dated 24.02.2022 7,776 8.19% 55.79%
4 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 05.04.2022 № 27-31-2022/Ид4465-22 This material is blocked on the territory of the Russian Federation on the basis of the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id4465-22 dated 05.04.2022 6,373 6.71% 62.51%
5 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 25.04.2022 № 27-31-2022/Ид5587-22 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id5587-22 dated 25.04.2022 3,013 3.17% 65.68%
6 Этот материал заблокирован на территории РФ согласно требованию Генеральной прокуратуры Российской Федерации от 27.02.2022 № 27-31-2022/Треб228-22 This material is blocked on the territory of the Russian Federation in accordance with the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Treb228-22 dated 27.02.2022 2,928 3.08% 68.76%
7 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 13.08.2022 № 27-31-2022/Иф-10643-22 This material has been blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/IF-10643-22 dated 13.08.2022 2,726 2.87% 71.63%
8 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации от 09.08.2022 № 27-31-2022/Ид11013-22 This material has been blocked on the territory of the Russian Federation based on the request of the Prosecutor General’s Office of the Russian Federation № 27-31-2022/Id11013-22 dated 09.08.2022 2,136 2.25% 73.88%
9 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры РФ № 27-31-2022/Ид13719-22 от 30.09.2022 This material is blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Id13719-22 of 30.09.2022 1,645 1.73% 75.62%
10 Этот материал заблокирован на территории РФ на основании требования Генеральной прокуратуры Российской Федерации № 27-31-2022/Треб855-22 от 30.07.2022 This material has been blocked on the territory of the Russian Federation based on the request of the General Prosecutor’s Office of the Russian Federation № 27-31-2022/Treb855-22 of 30.07.2022 1,456 1.53% 77.15%

Table 17: The ten block messages which we discovered to block the most videos in Russia.

The earliest court date mentioned in a block message was March 2, 2014, and the most recent was April 28, 2023, which was shortly before our testing period ended on May 14, 2023. This range covers a wide time period spanning 8 years and 11 months. Reviewing the cumulative distribution of the cited court case date in the messages, we see that there was an uptick in the rate of cited case dates after February 24, 2022 (see Figure 8 and Table 18), which coincides with the day that Russia began its full-scale invasion into Ukraine. Prior to this period, there was a steady and relatively consistent rate of dates mentioned in the justification. This increased pace diminished beginning late October or early November, 2022, until the end of our test period in May 2023. There also exists a gap in which no cases were cited from December 26, 2022, to January 26, 2023, although this may be explainable at least in part by the observance of the Eastern Orthodox Christmas holiday season. The reason for the brief period of diminished pace and gap is unclear. Overall the timing of these changes suggests that the ongoing conflict has dramatically increased the rate of blocking of video content for Russian users.

The 336 block messages citing court cases, the cumulative distribution of the court case dates over time; in red, an increased rate of court orders issued since the February 24, 2022, full-scale invasion of Ukraine; in yellow, the decreased rate beginning late October / early November, 2022 occurring until the end of our measurement period; in green, a gap in observed court orders between December 26, 2022, and January 26, 2023.
Figure 8: Among the 336 block messages citing court cases, the cumulative distribution of the court case dates over time; in red, an increased rate of court orders issued since the February 24, 2022, full-scale invasion of Ukraine; in yellow, the decreased rate beginning late October / early November, 2022 occurring until the end of our measurement period; in green, a gap in observed court orders between December 26, 2022, and January 26, 2023.
Time period Court orders per day Comparison to previous period
March 2, 2014 – February 23, 2022 0.0271
February 24, 2022 – October 31, 2022 0.826 Rate increased by factor of 30.5
November 1, 2022 – April 28, 2023 0.200 Rate decreased by factor of 4.14

Table 18: Comparison of rate of court orders during three time periods.

In contrast, among the videos blocked in Canada, there are no messages returned to users citing any legal justification for blocked content. The only two block messages which we observed justifying blocked videos in Canada are the more general message of “This video is unavailable in your country” (87.56%) and “Video sound unavailable” (12.44%). It is a common practice in social media moderation to restrict sound when that sound contains copyrighted music content. These messages in Canada are a stark difference from the messages seen in Russia which are more varied and which overwhelmingly cite a court order.

Message # of Videos Discovered Blocked Share
This video is unavailable in your country 2,288 87.56%
Video sound unavailable. 325 12.44%

Table 19: The two block messages which we discovered justifying blocked videos in Canada.

Limitations

In this section, we discuss some of the limitations of our methodology. First, our methods only uncover differential censorship (i.e., censorship which is present in one region but not another). Our methods cannot uncover censorship which VK applies to all regions or countries of the world. It is likely that this report undercounts censorship and other forms of moderation carried out on the platform, as we have no visibility into deletions of content that would apply to all regions.

Block message for the account of Yevgeny Prigozhin, blocked in both Canada and Russia.
Figure 9: Block message for the account of Yevgeny Prigozhin, blocked in both Canada and Russia.

To illustrate this limitation, at the time of this writing, we are aware of at least seven instances of Russian-court ordered takedowns being applied outside of Russia. First is the account of Yevgeny Prigozhin, which when we browsed it on June 26, 2023, from either Canada, Ukraine, or Russia, displayed a block message citing a court order, dated June 24, 2023 (see Figure 9). On June 24, 2023, Prigozhin, the founder and leader of the Wagner mercenary group, led a mutiny and marched toward Moscow, which ended abruptly when the mercenary agreed to leave Russia for Belarus. There are six other accounts that we found blocked and that displayed this block message which are also related to Wagner Group:

It is unclear why VK blocked these Wagner Group-associated pages in Canada. In the block message, there is no explanation of these accounts violating any VK terms of service or safety guidelines. The only justification given is a Russian court order and a request from Roskomnadzor, which should only apply to users based in Russia. While pages related to the Wagner Group are the only examples of Russian court-ordered blocking being applied to users broadly outside of Russia that we are aware of, there may exist other instances of blocking which we have not discovered.

A second limitation of our work is that we did not perform testing from accounts which were signed in. As a consequence, we were neither able to receive search results for nor view videos which the poster of the video configured to only be visible to signed-in users. However, we do not believe this limitation to influence the direction of our findings in any meaningful way.

Another limitation of our work is that our methodology limited us to finding missing results in search queries whose results had fewer than 1,000 results. This limitation does not strictly mean that we cannot detect blocked content when it appears in the results of a query with at least 1,000 results, but it does mean that we will have to detect such content by its absence in a more narrow query. We believe that our large query sample size ameliorates this limitation, and we do not believe this limitation to skew the direction of our findings in any meaningful way.

Finally, as we tested the titles of the most popular articles on multiple language editions of Wikipedia, our methods are biased toward finding blocked videos, communities, and people related to popular topics on Wikipedia in these language editions. As an example, topics on Russia’s invasion of Ukraine were popular in the Ukrainian language Wikipedia during January, February, and March 2023. As we tested the titles of the most popular articles on Ukrainian Wikipedia during this period, we incidentally tested a large number of queries related to the Ukraine war. While it is possible that this topic is popular on VK for the same reasons it is popular on the Ukrainian language Wikipedia, it is also possible that we are oversampling such videos on VK due to our large number of test queries related to this topic.

Discussion

In this section, we conclude by discussing how our findings contribute to a greater understanding of Russian social media censorship in Russia and how it compares to censorship abroad. Finally, we compare the Russian approach to social media censorship to the Chinese model of social media censorship.

Broad keyword-based blocking of LGBTIQ content

While much of the analysis we performed was on blocked videos, communities, and personal accounts, we also discovered that searches for communities and personal accounts in Russia were censored when their search queries contained keywords related to LGBTIQ content (see Table 2). We found that the use of keyword-based filtering applied exclusively to LGBTIQ terms within Russia and that it is not active in Canada or Ukraine. Moreover, it is unclear why this filtering is only applied to searches for communities and personal accounts, but not for videos. To underscore how these terms were not being censored as part of an “adult only” or safe-search filter but only being used for LGBTIQ filtering, we additionally tested the following search queries:

  • pornography
  • порнография
  • porn
  • порно
  • sex
  • секс
  • fuck
  • ебать
  • блять
  • трахаться
  • трахать
  • anal
  • анальный
  • bitch
  • сука
  • pussy
  • пизда

As none of the terms above triggered keyword-based censorship of our search queries, we can conclude that the LGBTIQ-based keyword censorship is not part of a larger safe-search feature but rather one meant to target solely LGBTIQ-related search queries.

It is unclear why keyword-based filtering is only used to censor LGBTIQ search queries and not queries for content critical of Putin, the invasion of Ukraine, or other content found blocked elsewhere on VK. Keyword-based blocking is a particularly blunt tool. On one hand, it is overly broad, capturing content that may not have been intended. For example, we found that many anti-LGBTIQ groups existed on VK, and thus the blocking of LGBTIQ-related searches prevented users from discovering pro- and anti-LGBTIQ groups alike. On the other hand, keyword-based blocking is simultaneously narrow. As one example, we found that “LGBT” and “LGBTQ” were blocked but not other variants such as “LGBTQIA”. As another, although “gay” was censored, “gays” was not. Some terms were blocked in both Cyrillic and Roman characters (e.g., “Геи” and “gay”) while others only in Cyrillic but not in Roman (e.g., “Фембой” but not “femboy”). These inconsistencies give the impression that the list of blocked terms used by VK was arbitrarily created. Finally, as keyword-based filtering only applies to searches, users can still access communities and personal accounts whose names contain blocked keywords by searching for other keywords in their names or by typing the URLs to their pages directly.

Given that keyword-based blocking is simultaneously both too broad and too narrow, as well as ineffective, it is unclear why it is applied only to LGBTIQ content, much less any content at all. One possibility is that, because the anti-LGBT “propaganda” laws (including the federal law “for the Purpose of Protecting Children from Information Advocating a Denial of Traditional Family Values”) are vague concerning what constitutes “LGBT propaganda,” this type of filtering is intended to be very visible to users, although it is not actually effective at censoring content. In this sense, this filtering may be acting as a sort of “compliance checkbox” to attempt to demonstrate compliance with Russian law.

We found that VK attributed every blocked community or person in Russia to a court decision and that every blocked video in Russia was attributable to a blocked community or person. Altogether, there were 336 different VK block messages that cited 303 unique legal case numbers. In some instances, we were able to find the text of the court decision ordering the blocking of the communities or personal accounts and retrieve the law cited to justify the ordering of the blocking. More study is needed to systematically analyze the court cases and laws justifying VK’s blocking decisions and to determine both whether VK cites appropriate court decisions to justify its blocking and whether those court decisions cite appropriate laws to justify their blocking orders. It seems that, in many cases, the necessary information may be available to perform such an analysis. At this time, we will merely call attention to one block message which is notable because a press release was also cited:

“Этот материал заблокирован на территории РФ на основании решения суда/уполномоченного федерального органа исполнительной власти (Металлургический районный суд г. Челябинска – Челябинская область) от 11.12.2019 № 2а-3052/2019 Комментарий ВКонтакте: vk.com/press/blocking-public38905640

In English:

[This material was blocked on the territory of the Russian Federation on the basis of the decision of the court / authorized federal executive body (Metallurgichesky District Court of Chelyabinsk – Chelyabinsk Region) dated December 11, 2019 No. 2a-3052/2019 Comment VKontakte: vk.com/press/blocking-public38905640]

In the linked March 2021 press release, VK notes Russia’s increasingly tightening regulations of social media networks and legal obligations to implement proactive censorship measures in justifying the blocking of the “Альянс гетеросексуалов и ЛГБТ за равноправие” [Alliance of Heterosexuals and LGBT for Equality] VK community.

Gaps in blocking transparency

While VK consistently attributed blocking in Russia to court orders, VK’s approach of blocking users, and then transitively all of their videos, rather than blocking specific videos themselves, still lacks transparency on multiple levels. Although VK consistently provides a legal justification for why a community or personal account is blocked in Russia, when viewing a blocked video it is not clear who the poster was, and, even if the blocked poster is known, it is not clear to other VK users which video or other content from that user may be responsible for their blocking. This problem is exacerbated as VK’s blocking has the effect of capturing all past and future posted videos of the blocked community or personal account. Thus, VK’s approach has a tendency to over-block, as a community or personal account may have multiple interests and post content on a variety of topics, including benign ones that are unrelated to the original justification of a block. Reviewing some of the court orders which VK cited in justifying account blocking, we found that the orders had no associated time period. Thus, the blocks may be applied in perpetuity, exacerbating this over-blocking. Further, it is not clear if VK notifies a poster that their content is being blocked in Russia. Thus, VK users may be unaware that all of their content is unavailable to users in Russia, especially if they are using VK from a region other than Russia.

We found that copyrighted entertainment content was often blocked in Canada including TV, sports, and music, while current events was the type of content that was blocked the most in Russia, mainly those dealing with the Ukraine war and Belarus. Copyrighted content was thus largely accessible in Russia even when it was blocked in Canada. Although in this report we did not systematically compare Ukraine and Canada for differential blocking, we generally observed that the same copyrighted content unavailable in Canada was accessible in Ukraine. This observation suggests that VK approaches moderation around copyright on a geographical basis, rather than using a method which distinguishes Russia from all other countries. Based on our analysis, VK’s approach to copyright moderation is far more lax and permissive in Russia and Ukraine than in Canada. That is, VK users in Canada have more content restricted based on a copyright justification, compared to users in Russia and Ukraine. We also found, despite this uneven application of copyright enforcement, that pirated content is widespread on the platform. This finding is especially true for ebooks and music content, which are widely available on VK.

An example of pirated ebook content available and downloadable on the “English Books and Magazines” VK community.
Figure 10: An example of pirated ebook content available and downloadable on the “English Books and Magazines” VK community.

This differential treatment of users by region is also revealed in other manners, such as in VK’s privacy policy, which has different data retention policies for Russian users versus users outside of Russia. For example, according to those policies, VK “store[s] Russian users’ messages for six months and other data for a year (in accordance with paragraph 3, Article 10.1 of Federal Law ‘On Information, Information Technologies, and Information Protection).”

Comparison to Chinese social media censorship

China’s social media information control system is decentralized and characterized by “intermediary liability,” or what China refers to as “self-discipline,” allowing the Chinese government to push responsibility for information control to the private sector. Internet operators which are deemed to have failed to have adequately implemented information controls are liable to receive fines, have their business licenses revoked, or be the recipient of other adverse actions. These companies are largely left to decide on their own regarding what to proactively censor on their platforms, attempting to balance the expectations of their users with appeasing the Chinese government. In China, block messages are often not displayed by Chinese platforms and therefore users have no way of knowing the legal justification for the blocked content. However, in Russia, VK ultimately attributed the blocking of each video, community, or person to whichever court case ordered the blocking of that content. In some cases, we were able to find the text of the court case and retrieve the laws cited in justifying the takedown request. While much may be lacking in terms of due process in Russia’s court-ordered blocking approach, this system is still more transparent than in China, where blocking decisions are more proactively done by the private sector, with blocking decisions being left largely to the discretion of Internet operators.

Chinese social media companies have struggled to grow their platforms globally and to apply information controls while they expand. Tencent’s WeChat has been scrutinized for its application of Chinese political censorship and surveillance, either expressly or secretly, to conversations among users entirely registered outside of China. Furthermore, when using WeChat, users have no visibility into whether they are communicating with a user registered in China and therefore cannot predict the extent to which their communications will be subject to political censorship or surveillance. Unlike Tencent, Bytedance simply abandoned the idea of growing a single platform with radically different information control rules for users inside versus outside of China. Instead, Bytedance operates Douyin inside China and a platform with a completely distinct user base, TikTok, outside of China. VK’s approach of blocking community and user accounts, but not content directly, may have some advantage in alleviating the friction in attempting to expand VK globally, or outside of the Russian information control regime. On VK, users in Russia are simply unable to communicate or read the content of users blocked in Russia, and thus there have not been negative media stories covering how non-Russia-based users are having their content deleted in the style of those covering WeChat. This difference is because, on VK, politically motivated blocking is seemingly applied only to users and not individual content.

At a high level, there are both similarities and differences in the topics censored in Russia and China. In both countries, foreign news sources and criticism of its top leaders are subject to censorship. However, each country also has their particular sensitivities. For instance, while Chinese social media has not always been friendly to LGBTIQ content, in Russia, such content is aggressively targeted, as facilitated by the anti-LGBTIQ “propaganda” laws. In light of its invasion of Ukraine, Russia is also particularly sensitive to content that is critical of the Russian side of the armed conflict. Conversely, some of China’s evergreen political sensitivities include the Falun Gong spiritual/political movement, the status of Taiwan, and calls for independence of Tibet, Xinjiang, and Hong Kong. While Chinese social media has also been quick to censor content related to the COVID-19 pandemic, we did not find differential censorship relating to COVID-19 on VK, but this might be because such content was removed in all regions that we analyzed.

While both China and Russia use Internet censorship to protect the political images of their own leaders, they are inconsistent in how they protect the images of each others’ leaders. Although Chinese Internet platforms appear willing to help protect the image of Putin, we found no evidence of VK blocking content critical of Xi Jinping or any other Chinese leader. In our ongoing study of censorship on Chinese search platforms, we have found that Chinese search engines Baidu and Sogou and video sharing site Bilibili enforce censorship rules relating to “普京” [Putin]. As examples, we found that search queries on Sogou containing “普京 + 独裁” [Putin + dictatorship], “普京 + 希特勒” [Putin + Hitler], or “普京窃国” [Putin’s kleptocracy] restricted search results to only Chinese state media websites and other Beijing-aligned sources. While some censorship rules seem solely focused on protecting Putin’s image, others may reveal China’s less-than-altruistic motivations in doing so. For instance, “普京亲信兵变 + 震动中南海” [mutiny of Putin’s cronies + shaking in the Chinese Communist Party’s headquarters] and “台湾 + 成为下一个乌克兰” [Taiwan + becoming the next Ukraine] reveal China’s insecurities concerning how Prigozhin’s mutiny may be predictive of the future stability of Chinese Communist Party’s own regime and how Russia’s unanticipated difficulties invading Ukraine may be prognostic of any future realization of China’s own ambitions to invade Taiwan. More generally, Chinese censors may be motivated to protect Putin’s image not only because Russia is an ally of China but also because of the similarities in and therefore common insecurities born from their methods of governance. Regardless of Chinese censors’ motivations here, we found no evidence that Russia’s VK reciprocated the favor by helping to protect China’s leaders from criticism on VK.

Finally, while there are theories that the Internet is “balkanizing” or becoming a “splinternet” wherein different countries or regions slowly form their own isolated networks over time, examples of social media censorship from both China and Russia show that the borders of these isolated networks may be fairly permissive but only in one direction. On WeChat, users with China-registered accounts are subject to the platform’s invasive political censorship, whereas users in other countries can not only access WeChat but also express political ideas with one another with relative amounts of freedom compared to their Chinese counterparts. We find the same with VK in that VK subjects users in Russia to pervasive levels of political censorship, whereas users in other countries are not only allowed membership on the site but are also relatively more free to engage in political speech. In an irony, each of these social media networks subjugates users from the country in which the network was founded with the greatest restrictions, whereas, not only do these networks allow users to join from other countries but also grant these users the freedom to engage in a larger range of political expression.

Data

The complete set of videos, communities, and people that we found blocked in Russia and Canada, as well as their block messages, are available on GitHub at the following link: https://github.com/citizenlab/not-ok-on-vk-data

Acknowledgments

We would like to thank Michelle Akim, Siena Anstis, Pellaeon Lin, Irene Poetranto, Adam Senft, and Andrei Soldatov for valuable editing and peer review. Research for this project was supervised by Ron Deibert.

Appendix A: Accessibility of VK in Ukraine

In 2017, a presidential decree issued by the Petro Poroshenko administration ordered VK and other Russian social media platforms to be blocked on Ukrainian network providers. This order was extended in 2020 by the Zelensky administration until 2023. In order to accurately contextualize the findings in this report with the real world effect on Ukrainian users, we reviewed recent data to measure the accessibility of VK in Ukraine. Namely, we reviewed relevant data collected by the Open Observatory of Network Interference (OONI), which is a non-profit organization that collects global data of website accessibility. Measurements of website accessibility are performed by volunteers who run software (called OONI Probe) which attempts to access a list of websites, including VK, reporting the results to a centralized database. We reviewed this database of measurements, specifically all attempts to access any site under the “vk.com” domain space in Ukraine from May 20, 2023 to June 20, 2023. This review covered a total of 295 measurements coming from 20 networks in Ukraine. We find that VK remains consistently blocked in Ukraine during this period on all but three networks. We also find that in six networks circumventing the blocks is likely easy to accomplish based on how the blocking is carried out.

ASN Network Name # of Measurements VK Accessible? Blocking Transparent? Blocking Type Blocking Method
AS13188 TRIOLAN 49 No Yes Block Page With Legal Justification Incorrect DNS Server Response
AS24685 DOMONET 43 No No IP Blocking Timeouts
AS25386 INTERTELECOM-AS 40 No Sometimes IP Blocking and Blockpage Timeouts + Block Page, No DNS tampering
AS15895 Kyivstar PJSC 35 No Yes Block Page With Legal Justification Incorrect DNS Server Response
AS6849 UKRTELNET 31 No No Localhost DNS Response Incorrect DNS Server Response
AS30886 KOMITEX-AS 25 No No Localhost DNS Response Injection of DNS Response
AS25521 ASN-ASIPN 16 No Yes Block Page With Legal Justification Incorrect DNS Server Response
AS12963 VOLZ Scientific -Industrial Firm Volz Ltd 12 No No IP Blocking Timeouts
AS3326 DATAGROUP Datagroup PJSC 12 No No IP Blocking Timeouts
AS25482 ISP-STATUS ISP STATUS 10 No No HTTP Blocking HTTP level blocking but TCP connection succeeds
AS200000 UKRAINE-AS 8 Yes No Blocking No Blocking No Blocking, but timed out 1/8 times.
AS197058 ASPSTS 4 No No IP Blocking Host unreachable
AS21497 UMC-AS 2 No Yes Block Page With Legal Justification Incorrect DNS Server Response
AS44477 STARK-INDUSTRIES 2 No No IP Blocking Timeouts
AS14593 SPACEX-STARLINK 1 Yes No Blocking No Blocking No Blocking
AS29436 ASN-IMPERIAL 1 No No IP Blocking Timeouts
AS56835 UTELS 1 No No IP Blocking Timeouts
AS56851 VPS-UA-AS 1 No No IP Blocking Timeouts
AS57033 ALTAIR-KPK-AS 1 No No IP Blocking Timeouts
AS196767 INMART1-AS 1 Yes No Blocking No Blocking No Blocking

Table 20: Summary of VK availability in Ukraine during May 20 to June 20, 2023, according to OONI measurements.

We find that VK is blocked in Ukraine using a variety of methods depending on the network. This variation indicates that censorship of VK is likely implemented at the ISP level rather than by a government-run national filtering system. Furthermore, we see three networks — UKRAINE-AS (AS200000), UMC-AS (AS21497) and the Satellite network provider SpaceX (AS14593) — where VK remains available. However, on the network UKRAINE-AS, although VK was accessible in seven measurements, the connection timed out in one measurement. On all other remaining 17 networks, VK was blocked, though the method by which this was implemented varied. The networks mainly either blocked VK by IP (9 out of 20 networks) or returned incorrect IP responses from their DNS servers (5 out of 20 networks). One network, INTERTELECOM-AS (AS25386), both blocked VK by IP, as well as by providing an incorrect DNS server response that returned a block page. One network (KOMITEX-AS AS30886) injected incorrect DNS responses and another, ISP-STATUS (AS25482), blocked VK at the HTTP level.

It is important to note that blocking which is implemented solely by returning an incorrect DNS server response, as is the case on five networks, should be easy for knowledgeable users to circumvent. Simply changing the DNS server from the ISP hosted default to a public DNS server provided by Quad9 or CloudFlare may be sufficient to circumvent this blocking. Furthermore, some systems may be already preconfigured to use a DNS server not provided by a user’s ISP. Firefox, for instance, uses DNS over HTTPS (DoH) by default in multiple countries including Ukraine, automatically circumventing DNS-based blocking. Networks implementing such easily, and perhaps even accidentally, evadable blocking may explain why Ukraine still has the third largest number of visitors to VK despite its drop in visitation.

Only four networks communicated the block transparently to users all of the time by displaying a block page: Triolan (AS13188), Kyivstar (AS15895), ASIPN, which is known more widely as IPnet.ua, (AS25521), and UMC (AS21497). All four networks are large residential ISPs in the country. For example, this sample measurement on the Triolan network shows that an attempt to access “https://vk.com” leads to an SSL error or a block page. This page reads in part: “WARNING! Access to the resource cannot be granted! Access to this resource is not granted in order to fulfill the Decrees of the President of Ukraine”, and it cites the relevant legal decrees which are listed on the page as:

One network, Intertelecom (AS25386), communicated the block transparently only some (42.5%) of the time. The remaining 12 networks did not transparently return a block message to the user. For users on these networks, attempts to access VK resulted in these requests failing, which is similar to other network errors, and without providing a legal justification. Therefore, we found the blocking of VK in Ukraine to be highly variable. Some networks perform no blocking, and among those that do, users on those networks may not experience blocking depending on their DNS configuration. Finally, for those in Ukraine whose access to VK is blocked, they may receive a block message or a network or SSL error.

Figure TKTK: An example of a transparent block page returned on the network “Triolan” [AS13188] in Ukraine for vk.com on June 18, 2023.
Figure 11: An example of a transparent block page returned on the network “Triolan” [AS13188] in Ukraine for vk.com on June 18, 2023.
]]>
Missing Links: A comparison of search censorship in China https://citizenlab.ca/2023/04/a-comparison-of-search-censorship-in-china/ Wed, 26 Apr 2023 14:00:08 +0000 https://citizenlab.ca/?p=79316 This report has an accompanying FAQ.

Key findings

  • Across eight China-accessible search platforms analyzed — Baidu, Baidu Zhidao, Bilibili, Microsoft Bing, Douyin, Jingdong, Sogou, and Weibo — we discovered over 60,000 unique censorship rules used to partially or totally censor search results returned on these platforms.
  • We investigated different levels of censorship affecting each platform, which might either totally block all results or selectively allow some through, and we applied novel methods to unambiguously and exactly determine the rules triggering each of these types of censorship across all platforms.
  • Among web search engines Microsoft Bing and Baidu, Bing’s chief competitor in China, we found that, although Baidu has more censorship rules than Bing, Bing’s political censorship rules were broader and affected more search results than Baidu. Bing on average also restricted displaying search results from a greater number of website domains.
  • These findings call into question the ability of non-Chinese technology companies to better resist censorship demands than their Chinese counterparts and serve as a dismal forecast concerning the ability of other non-Chinese technology companies to introduce search products or other services in China without integrating at least as many restrictions on political and religious expression as their Chinese competitors.

Introduction

Search engines are the information gatekeepers of the Internet. As such, search platform operators have a responsibility to ensure that their services provide impartial results. However, in this report, we show how search platforms operating in China infringe on their users’ rights to freely access political and religious content, by implementing rules to either block all results for a search query or by only selectively showing results from certain sources, depending on the presence of triggering content in the query.

In this work we analyze a total of eight different search platforms. Three of the search platforms are web search engines, including those operated by Chinese companies — Baidu and Sogou — and one operated by a North American company — Microsoft Bing — whose level of censorship we found to in many ways exceed those of Chinese companies. While China’s national firewall blocks access to webites, the role that Baidu, Microsoft, and Sogou play in controlling information is in overcoming two of the firewall’s limitations. First, due to the increasingly ubiquitous use of HTTPS encryption, China’s firewall can typically only choose to censor or not censor entire sites as a whole. However, these search engine operators overcome this limitation by selectively censoring sites depending on the type of information that the user is querying. Second, China’s firewall operates opaquely, displaying a connection error of some kind in a user’s web browser. By hiding the very existence of sites containing certain political and religious content, Baidu, Microsoft, and Sogou aid in preventing the user from being informed that they are being subjected to censorship in the first place.

We also examine search censorship on Chinese social media companies, namely Baidu Zhidao, Bilibili, Douyin, and Weibo. Perhaps more familiar to non-Chinese audiences are Douyin and Weibo. Douyin, developed and operated by TikTok’s ByteDance, is the version of TikTok operating in China, and Weibo is a microblogging platform similar to Twitter. Perhaps less known are Baidu Zhidao and Bilibili. Baidu Zhidao is a question and answer platform similar to Quora operated by the same company as the Baidu search engine, and Bilibili is a video sharing site similar to YouTube. We also look at e-commerce platform Jingdong, which is similar to Amazon.

Given the strict regulatory environment which they face, users in China have limited choice in how they search for information. However, even among those limited choices, we nevertheless found important differences in the levels of censorship and in the availability of information among these search platforms. Most strikingly, we found that, although Baidu — Microsoft’s chief search engine competitor in China — has more censorship rules than Bing, Bing’s political censorship rules were broader and affected more search results than Baidu. This finding runs counter to the intuition that North American companies infringe less on their Chinese users’ human rights than their Chinese company counterparts.

The remainder of this report is structured as follows. In “Background” and “Related work”, we summarize the legal and regulatory environment in which Internet companies in China operate as well as existing research on Chinese search censorship. In “Model”, “Methodology”, and “Experimental setup”, we describe how we model censorship rules, the manner in which we discover each platform’s censorship rules, and the conditions in which we executed our experiments. In “Results”, we reveal our findings of over 60,000 unique censorship rules being discovered, and we attempt to characterize which platforms censor more of what kinds of material. Finally, in “Limitations” and “Discussion”, we discuss the limitations of our study, what our findings say about non-Chinese companies entering the Chinese market, and implications for future research.

Background

Internet companies operating in China are required to comply with both government laws concerning content regulations as well as broader political guidelines not codified in the law. Multiple actors within the government – including the Cyberspace Administration of China and the Ministry of Public Security – hold companies responsible for content on their platforms, either through monitoring platforms for violations or investigating online criminal activity. Companies are expected to dedicate resources to ensure that all content is within legal and political or ideological compliance, and they can be fined or have their business licenses revoked if they are believed to be inadequately controlling content. China’s information control system is characteristically one of intermediary liability or “self-discipline”, which allows the government to push responsibility for information control to the private sector.

To understand the kind of information which is expected to be censored by companies in China, we can lean on at least four kinds of sources: (1) state legislation and regulations, (2) official announcements about state-led internet clean-up campaigns, (3) government-run online platforms where users can report prohibited material, and (4) official announcements about what kinds of prohibited material has been reported to the authorities.

Chinese government legislation and regulations have included provisions specifying what kinds of online content are prohibited. These documents include the Measures for the Administration of Security Protection of Computer Information Networks with International Interconnections (1997), the Cybersecurity Law (2017), Norms for the Administration of Online Short Video Platforms and Detailed Implementation Rules for Online Short Video Content Review Standards (2019), and Provisions on the Governance of the Online Information Content Ecosystem (2020). Many of the categories of prohibited content are shared among these four documents, as indicated in Figure 1. Shared categories include pornography and attacks on China’s political system. However, it is also clear that more recent documents – in particular, the 2019 Norms for the Administration of Online Short Video Platforms and the 2020 Provisions on Ecological Governance of Network Information Content – have provided new categories of prohibited content. These include specific prohibitions against “harming the image of revolutionary leaders or heroes and martyrs” [损害革命领袖、英雄烈士形象] and more vague prohibitions against material which promotes “indecency, vulgarity, and kitsch” [低俗、庸俗、媚俗].

Measures for the Administration of Security Protection of Computer Information Networks with International Interconnections (1997)

Article 5:

No unit or individual may use the Internet to create, replicate, retrieve, or transmit the following kinds of information:

  1. Inciting to resist or breaking the Constitution or laws or the implementation of administrative regulations
  2. Inciting to overthrow the government or the socialist system
  3. Inciting division of the country, harming national unification
  4. Inciting hatred or discrimination among nationalities or harming the unity of the nationalities
  5. Making falsehoods or distorting the truth, spreading rumors, destroying the order of society
  6. Promoting feudal superstitions, sexually suggestive material, gambling, violence, murder
  7. Terrorism or inciting others to criminal activity; openly insulting other people or distorting the truth to slander people
  8. Injuring the reputation of state organs
  9. Other activities against the Constitution, laws or administrative regulations

Cybersecurity Law (2017)

Article 12:

Any person and organization using networks shall abide by the Constitution and laws, observe public order, and respect social morality; they must not endanger cybersecurity, and must not use the Internet to engage in activities endangering national security, national honor, and national interests; they must not incite subversion of national sovereignty, overturn the socialist system, incite separatism, break national unity, advocate terrorism or extremism, advocate ethnic hatred and ethnic discrimination, disseminate violent, obscene, or sexual information, create or disseminate false information to disrupt the economic or social order, or information that infringes on the reputation, privacy, intellectual property or other lawful rights and interests of others, and other such acts.

Norms for the Administration of Online Short Video Platforms and Detailed Implementation Rules for Online Short Video Content Review Standards (2019)

4. Technical Management Regulations:

Based on the basic standards for review of online short video content, short video programs broadcast online, as well as their titles, names, comments, Danu, emojis, and language, performance, subtitles, and backgrounds, must not have the following specific content appear (commonly seen problems):

  1. Content attacking the national political system or legal system
  2. Content dividing the nation
  3. Content harming the nation’s image
  4. Content harming the image of revolutionary leaders or heroes and martyrs
  5. Content disclosing state secrets
  6. Content undermining social stability
  7. Content harmful to ethnic and territorial unity
  8. Content counter to state religious policies
  9. Content spreading terrorism
  10. Content distorting or belittling exceptional traditional ethnic culture
  11. Content maliciously damaging or harming the image of the state’s civil servants such as from people’s military, state security, police, administration, or justice, or the image of Communist Party members
  12. Content glamorizing negativity or negative characters
  13. Content promoting feudal superstitions contrary to the scientific spirit
  14. Content promoting a negative and decadent outlook on life or world view and values
  15. Content depicting violence and gore, or showing of repulsive conduct and horror scenes
  16. Content showing pornography and obscenity, depicting crass and vulgar tastes, or promoting unhealthy and non-mainstream attitudes towards love and marriage
  17. Content insulting, defaming, belittling, or caricaturing others
  18. Content in defiance of social mores
  19. Contents that is not conducive to the healthy growth of minors
  20. Content promoting or glamourising historical wars of aggression or colonial history
  21. Other content that violates relevant national provisions or social mores and norms

Provisions on the Governance of the Online Information Content Ecosystem (2020)

Article 6:

A network information content producer shall not make, copy or publish any illegal information containing the following:

  1. Violating the fundamental principles set forth in the Constitution
  2. Jeopardizing national security, divulging state secrets, subverting the state power, or undermining the national unity
  3. Damaging the reputation or interests of the state
  4. Distorting, defaming, desecrating, or denying the deeds and spirit of heroes and martyrs, and insulting, defaming, or otherwise infringing upon the name, portrait, reputation, or honor of a hero or a martyr
  5. Advocating terrorism or extremism, or instigating any terrorist or extremist activity
  6. Inciting ethnic hatred or discrimination to undermine ethnic solidarity
  7. Detrimental to state religious policies, propagating heretical or superstitious ideas
  8. Spreading rumors to disturb economic and social order
  9. Disseminating obscenity, pornography, force, brutality and terror or crime-abetting
  10. Humiliating or defaming others or infringing upon their reputation, privacy and other legitimate rights and interests
  11. Other contents prohibited by laws and administrative regulations

Article 7:

A network information content producer shall take measures to prevent and resist the production, reproduction and publication of undesirable information containing the following:

  1. Using exaggerated titles that are seriously inconsistent with the contents
  2. Hyping gossips, scandals, bad deeds, and so forth
  3. Making improper comments on natural disasters, major accidents or other disasters
  4. Containing sexual innuendo, sexual provocations, and other information that easily leads to sexual fantasy
  5. Showing bloodiness, horror, cruelty, and other scenes that causes physical and mental discomfort
  6. Inciting discrimination among communities or regions
  7. Promoting indecency, vulgarity, and kitsch
  8. Contents that may induce minors to imitate unsafe behaviors, violate social morality, or induce minors to indulge in unhealthy habits
  9. Other contents that adversely affect network ecology
Figure 1
: Types of prohibited online content listed in government legislation and regulations.

The government legislation and regulations listed in Figure 1 are not the only official sources detailing what kinds of online content are either legally prohibited or are politically undesirable. Another indicator of what online material is censored are official descriptions of internet clean-up campaigns. Since 2013, China’s cyber regulator the Cyberspace Administration of China, the Propaganda Department’s Office of the National Working Small Group for “Combating Pornography and Illegal Publications”, the Ministry of Public Security, and other party-state organs have conducted annual special operations for internet purification [净化网络环境专项行动, abbreviated as 净网]. These special operations involve identifying websites, platforms, and accounts which contain prohibited content, compelling the removal of content, and punishing those responsible through warnings or administrative or criminal penalties.

Internet purification operations initially concentrated on “obscene pornographic information” [淫秽色情信息]. But between 2013 and 2022, the focus of these special operations as stated in annual and semi-annual announcements widened to include a broader range of legally or politically prohibited content. An aggregate list of the targets of internet purification campaigns mentioned in these announcements is provided in Figure 2. Prohibited content mentioned in these announcements has recently included material which is “emotionally manipulative” [情感操控] (mentioned in 2020), “historical nihilistic” [历史虚无主义] (mentioned in 2021), or promotes “divination and superstition” [占卜迷信] (mentioned in 2022). An indication of the increasing breadth of these operations can be found in annual or semi-annual announcements made online by the Cyberspace Administration of China and the Ministry of Public Security about the progress of these operations. These announcements include information of the kinds of prohibited content which authorities have identified and removed. It is not clear if these annual announcements provide a full list of the material authorities targeted for removal during the year in question. Nonetheless, these announcements make clear that state-led internet purification campaigns routinely identify and remove not only pornographic online material, but many other kinds of prohibited content listed in relevant legislation.

  • Political rumors; historical nihilism; misusing the 100th anniversary of the founding of the Communist Party to engage in commercial activity; tampering with the history of the Party and the nation; slandering heroes and martyrs; opposing basic Constitutional principles; information which threatens national security
  • Violence; weapons; terrorism
  • Harmful material related to ethnic groups and religion; promotion of heterodox faiths, feudal superstitions, and online divination
  • Pornographic and vulgar content; socially harmful material; flaunting wealth and money worship; emotionally manipulative websites and platforms
  • Gambling
  • Fraud; illegal collection, editing, and publishing of financial information; publication of false information; blackmail; illegally buying and selling bank cards; sale of rare and endangered animals and plants
  • Illicit drugs
  • False advertising; false job recruitment posts; underhanded “black public relations”; paid internet posters
  • False pharmaceutical information; sale of counterfeit drugs; copyright infringement and counterfeiting
  • Illegal surrogacy; dishonest marriage websites
  • Unauthorized providing of online news services; fake news; misleading or false information on the epidemic situation in Beijing
  • Managing disorderly fan communities and online user accounts
Figure 2
: Aggregate list of targets of special operations for internet purification mentioned in annual or semi-annual announcements, compiled from 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, and 2022.

Beyond the targets of internet purification campaigns noted in Figure 2, further insight into what kind of online content is censored can be found on the Cyberspace Administration of China’s Illegal and Undesirable Information Reporting Center [中央网信办(国家互联网信息办公室)违法和不良信息举报中心]. As part of the special operations for internet purification, the Cyberspace Administration of China encourages domestic internet users to make named or anonymous reports of prohibited “undesirable content” [不良信息内容] or “harmful information” [有害信息] through the Reporting Center. As part of the reporting process, users are asked to identify the kind of prohibited content they have found according to nine categories provided by the Reporting Center, which are listed in Figure 3: politics, violent terrorism, fraud and blackmail, pornography, vulgarity, gambling, rights infringement, rumors, and a broadly defined category of “other.” The nine categories listed in Figure 3 broadly match both the kinds of content proscribed under Chinese government legislation and regulations covering online content, as well as the targets of internet purification special operations listed in announcements made by the Cyberspace Administration and the Ministry of Public Security.

  • Politics
  • Violent terrorism
  • Fraud and blackmail
  • Pornography
  • Vulgarity
  • Gambling
  • Rights infringement
  • Rumors
  • Other, including: online borrowing and lending; online criminal activity; online commercial disputes; online false and illegal advertising; online extortion and post deletion; email and telephone harassment; intellectual copyright infringement; piracy; fake media and fake journalists; gang activity; online cultural market activity including music, performance, and animation; and telecommunications user services
Figure 3
: Categories of prohibited material listed by the Cyberspace Administration of China’s Illegal and Undesirable Information Reporting Center.

Online announcements made by the Cyberspace Administration of China about the number of reports of prohibited content also suggest the kinds of content the Chinese state seeks to censor. These announcements arrange prohibited content into various categories, which only partially match those listed by the Reporting Center. Over the years these subcategories have included pornography, politics, vulgarity, gambling, rights infringement, rumors, terrorism, fraud, online extortion, paid post deletion, and other forms of content.

We performed searches on Google and Baidu for terms associated with these announcements — “全国网络举报受理情况” [national situation of the handling of online reports] and “全国网络举报类型分布” [national distribution of categories of online reports] — during February and March 2023. We collected websites which provided breakdowns of the categories reports of prohibited content made by Cyberspace Administration offices across China (“全国各地网信办”) and specific websites (“各网站”). We found some of these announcements on websites run by the national Cyberspace Administration of China or reporting websites run by provincial authorities, while others were published on news media websites.

These announcements do not appear to be consistently available, and we were only able to find ten announcements: nine monthly announcements released between January 2016 and January 2017, and one annual announcement for the year 2020, which are presented in Table 1. Nonetheless, the announcements which are available indicate the kinds of material that are reported and censored across platforms and websites in China. In addition, these announcements provide statistical information on the number of reports of prohibited content per category. We have provided a breakdown of this statistical data in Table 1. Based on the statistical data contained in these announcements, the majority of reported prohibited content is pornographic, followed by either political content or “other” material which does not fall into any of the other categories.

2016 Jan 2016 May 2016 June 2016 Aug 2016 Sept 2016 Oct 2016 Nov 2016 Dec 2017 Jan 2020
Pornography 64.7 60.4 60.4 60.7 63.7 60.3 60.7 50.0 55.2 61.7
Politics 8.9 12.9 11.3 11.9 11.8 13.1 13.9 29.0 23.5 7.7
Vulgarity n/a n/a n/a n/a n/a n/a n/a n/a n/a 3.3
Gambling 1.4 1.3 1.8 3.9 2.2 3.5 1.8 1.6 1.6 9.8
Rights Infringement 2.5 5.7 3.9 2.7 3.9 4.0 4.3 3.9 4.1 2.2
Rumors n/a n/a n/a n/a n/a n/a n/a n/a n/a 1.1
Terrorism 0.1 2.1 0.7 2.0 1.0 2.1 1.1 1.1 1.1 0.9
Fraud 4.5 8.2 11.9 8.4 7.1 7.1 8.0 4.9 5.0 1.3
Online Extortion and Paid Post Deletion 0.2 0.1 0.2 0.6 0.8 0.1 0.5 0.5 0.5 n/a
Other 17.7 9.3 9.8 9.8 9.5 9.8 9.7 9.0 9.0 12.0

Table 1: For announcements spanning 2016 to 2020, the % of reports in that announcement spanning each prohibited online content category. Archived copies for these announcements are linked through the respective date.

Other Chinese government announcements give an indication of which platforms are responsible for hosting prohibited content. These reports provide monthly or annual totals of the number of pieces of prohibited content reported to the authorities, broken down according to the website or platform on which the content was found. Alongside warning, fining, or in other ways punishing companies for hosting prohibited content, these public reports have the function of naming and shaming companies for failing to fully comply with Chinese laws on online content management.

Platform # of Reports
Weibo 53.126 million
Baidu 25.961 million
Alibaba 11.689 million
Kuaishou 6.59 million
Tengxun 6.309 million
Douban 3.514 million
Zhihu 2.143 million
Jinri Toutiao 2.063 million
Sina Wang 934,000
Sogou 331,000

Table 2: For different Internet platforms, the number of reports of prohibited content for 2021.

The most recent announcement we found concerning reports of prohibited content broken down by platform is for the year 2021. The statistical data contained in this announcement, presented in Table 2, indicates that Weibo was subject to the majority of reports of prohibited content with 53.126 million reports, followed by Baidu (25.961 million) and Alibaba (11.689). The ten platforms listed in Table 2 account for roughly 110 million reports. According to the government announcement from which these data come, these 110 million reports are 75.6% of the 166 million reports of online prohibited content made in 2021.

Related work

There is a large body of previous research analyzing search platform censorship in China. Much of the earliest work focused on comparing censorship across web search engines accessible in China. In 2006, Reporters Without Borders tested by hand six keywords across multiple search engines accessible in China, finding that Yahoo returned the most pro-Beijing results among the first ten results compared to other search engines. In the same year, Human Rights Watch tested by hand 25 keywords and 25 URLs, finding that Baidu and Yahoo were the most censored. This earliest work was limited in analyzing keyword-based censorship by attempting to characterize and compare the top n results for a searched keyword. This type of analysis is limited due to its subjectivity and its inherent assumption that the search engine with the most politically sensitive results must be the least censored when there may be other explanations for fewer sensitive results than the application of censorship rules.

In 2008, in a follow up to the previous studies, Citizen Lab researchers tested 60 hand-picked keywords across multiple search engines using an approach in which search queries were formed by combining a keyword with a web domain preceded by the “site:” operator to determine which domains were censored from the results of which keywords. Of all of the previous work we review, this work is the closest to ours. However, there are nevertheless fundamental differences. The work, like its predecessors, was limited to small sample sizes and relies on hand-picked samples. More importantly, its methods also cannot differentiate between a search query which is censored and one that genuinely has no results. While this may not seem significant in the context of testing hand-picked keywords, in our work we develop a method which can test whether a string of text triggers a censorship rule even when it would ordinarily return no results, and our method can isolate the exact keyword or keywords present in that string which are triggering its censorship. This capability was necessary for bridging the gap between testing lists of curated keywords versus testing long strings of arbitrary text, which is a necessary component for automated and ongoing testing of strings of text from sources such as news articles.

In a 2011 work, to instrument automated and ongoing censorship testing, Espinoza et al. developed a novel method using named entity extraction to select interesting keywords from a long string of news article text to use in search engine testing. Specifically, their method was designed to extract certain nouns, namely, the names of people, places, and organizations. The significance of this work is that it facilitates automatic censorship testing which does not rely on hand curation of keywords but instead can take as input long strings of arbitrary text, such as from news articles, automatically selecting from that text certain keywords to test. However, it is limited in that it makes assumptions about what type of content is likely to be sensitive, namely, certain kinds of nouns, and was used to test keywords for censorship individually. In contrast, we have found that censorship rules often require the presence of multiple, typically related keywords and commonly consist of a variety of parts of speech. Instead of selecting interesting keywords from a long string of text to test individually, our method tests a long string of text for the presence of censored content as a whole, even if it would not otherwise have any search results. We can then, using additional search queries, isolate the exact keyword or keywords triggering the censorship of that text. Our method is completely agnostic to and requires no assumptions concerning parts of speech, semantics, word boundaries, or other aspects of language and effortlessly generalizes to Chinese, English, Uyghur, and Tibetan languages, among others.

In another 2011 work, Zhu et al. use automated methods to test curated keywords consisting of the 44,102 most-searched keywords on Baidu and Google.cn, 133 keywords known to be censored by China’s national firewall, 1,126 political leaders of the Chinese government, and 85 keywords chosen by hand based on current events. This work is to our knowledge the first to speculate about the existence of different “white lists” of domains allowed to appear in the results for censored queries, whereas previous work has been framed in measuring which domains were blocked. In our work, we confirm the existence of these lists of authorized domains and attempt to quantify how many different lists exist, characterize when each list is applied, and measure which domains appear on each one.

More recently, in 2022 Citizen Lab researchers analyzed Microsoft Bing’s autosuggestion system for Chinese-motivated political censorship, finding that not only was it applied to users in mainland China but that it was also applied, partially, to users in North America and elsewhere. While this work is unlike ours in that it analyzed for Chinese censorship queries’ autosuggestions as opposed to the queries’ results proper, it is related to our work in that it studies the censorship of Bing, the only remaining major non-Chinese web search engine accessible in China.

Most recent work studying search platform censorship in China has analyzed the search censorship performed by social media platforms, namely that of Chinese microblogging platform Sina Weibo. For instance, in 2014, as part of the ongoing Blocked on Weibo project, Ng used automated methods to test for the censorship of 2,429 politically sensitive keywords previously curated by China Digital Times, finding that 693 were censored with explicit notifications. In a follow-up study months later, Ng found that most of these no longer had explicit censorship notifications but still returned zero results. Ng speculated that this may be due to either the removal of keywords from search censorship which were still being applied to post deletion censorship or due to Weibo transitioning to a more covert form of censorship. Our findings in this report suggest that both hypotheses for his findings could be true, as in Appendix A we demonstrate a method for evading Weibo search censorship but which often still yields zero results due to a simultaneous application of search censorship rules to post deletion, but also much of our analysis of Weibo focuses on a form of soft censorship which subtly restricts which results can appear for sensitive queries.

Often work looking at Weibo censorship is ad hoc and non-methodological, performed quickly in response to ongoing events, often to be featured in news articles or social media. For example, in 2017, Citizen Lab researchers studied Weibo search censorship of human rights advocate Liu Xiaobo leading up to and in the wake of his passing. They found that censorship of his name and surrounding topics intensified immediately following his passing but eventually returned to baseline levels. In our work we facilitate a method to automatically and methodologically detect search censorship rules introduced in response to developing news events with the intention to aid such rapid investigations.

In response to a 2022 incident in which Canadian Prime Minister Justin Trudeau had a heated, public conversation with Chinese President Xi Jinping, journalist Wenhao Ma tweeted his discovery that “特鲁多” [Trudeau], “小土豆” [little potato] (a Chinese nickname for Trudeau), and the English word “potato” were censored by Weibo search. We highlight this example for two reasons. First, Ma identified these keywords as being censored even though they had search results because their search results seemed to only contain results from official accounts with blue “V” insignia, recognizing that Weibo was applying a more subtle, softer form of censorship compared to simply displaying zero results. In our work, we develop a method to measure unambiguously when such keywords are subject to this type of censorship without attempting to glean it from the number of results from official accounts. Second, however, Ma’s claim that the English word “potato” was censored by Weibo was, while correct, misleading in that its censorship had nothing to do with Trudeau or potatoes but because it contains the substring “pot”, a slang term for marijuana. To avoid this type of inadvertent misattribution, in our work, we use a carefully designed algorithm to extract the exact keywords or combination of keywords triggering the censorship of a string of text.

Model

In previous work studying automatic censorship of messages on WeChat, we determined that WeChat automatically censors messages if they contain any of a number of blocked keyword combinations, and we had defined a keyword combination as a set of one or more keywords such that, if each keyword in the combination was present somewhere in a text, the text would be censored. For instance, if WeChat censors the keyword combination {“Xi”, “Jinping”, “gif”}, then any message containing all of these keywords, anywhere in the message, in any order, is censored. Thus, this combination would censor “Xi Jinping gif” and “gif of Xi Jinping” but not “Xi gif”. In this model, keywords can overlap as well, so even “Xi Jinpingif” would be censored since it contains the strings “Xi”, “Jinping”, and “gif” somewhere in the message, although the latter two overlap.

To our knowledge, this manner of modeling WeChat’s automated chat censorship rules as a “list of unordered sets” of blocked keywords completely captured WeChat’s censorship behavior. However, other censorship systems fitted with different censorship implementations may not be able to be adequately modeled using this model. For instance, a “list of ordered sequences” censorship system might require that the keywords appear in a specific order. For example, the rule (“Xi”, “Jinping”, “gif”) would censor “Xi Jinping gif” but not “gif of Xi Jinping”. A censorship system which implemented rules as a series of regular expressions would not only require the keywords to appear in an order but also that they not overlap. For example, the regular expression /Xi.*Jinping.*gif/ would censor “Xi Jinping gif” but neither “gif of Xi Jinping” nor “Xi Jinpingif”. Finally, a censorship system might use some machine learning algorithm to classify which queries to censor. However, we have not previously observed such systems used to perform real-time, political censorship, likely due to the requirements of such a system to operate with low false positives and to possess a nuanced, day-by-day understanding of what content is politically sensitive.

To attempt to capture the censorship behavior of as many search platforms as possible, in the remainder of this work we chose to use a “list of ordered sequences” model as in doing so we are being as conservative in our assumptions as possible. For instance, by using ordered sequences, we can still model unordered rules, although this may require multiple ordered sequences to capture every possible permutation (e.g., (“Xi”, “Jinping”, “gif”), (“gif”, “Xi”, “Jinping”), etc.). In our model we allow for the possibility that keywords triggering censorship in a query may be overlapping, but by facilitating this possibility we can still measure systems where keywords cannot.

Throughout the remainder of this work, we use the term keyword combination to refer to such a “list of ordered sequences”, and we will express them as keywords separated by plus signs, e.g., “Xi + Jinping + gif”. Later in our work, we reflect on this model more. In our “Methodology” section, we explain exactly how we measure which sequence of keywords is triggering the censorship of a censored query, and in our “Results” section we reflect on how effective our model performed in capturing the actual censorship behavior of the search platforms which we measured.

Methodology

In this section we describe our overall experimental methodology and then detail the methodologies of three different experiments which we perform.

Search platforms analyzed

We aimed to analyze the most popular platforms across different kinds of Internet platforms ranging from web search engines and e-commerce platforms to social media sites. Overall, we selected to analyze eight different search platforms, including three web search engines, four varying types of social media network, and one e-commerce platform (see Table 3 for the full list).

Website Description Object of Censorship
Baidu Web search engine Web page results
Baidu Zhidao Q&A platform Q&A post results
Bilibili Video sharing platform Video results
Bing Web search engine Web page results
Douyin Operated by TikTok’s ByteDance, the version of TikTok accessible from mainland China Video results
Jingdong E-commerce platform Product recommendations
Sogou Web search engine Web page results
Weibo Microblogging site similar to Twitter Microblog results

Table 3: The search platforms that we analyzed and the object of censorship which we measured on them.

Notably, our selection includes one platform not operated by a Chinese company — Microsoft Bing — whereas the remaining are all operated by Chinese companies.

Measuring whether a query is censored

When a user using a search platform searches for a query, this query is sent to a server. If the query is not censored, the server will respond with the corresponding matches to the query. However, with a censored query, there are two possibilities depending on the search platform:

  1. The server returns a unique notification when the user’s query contains sensitive content. We call this transparent censorship because the signal is unambiguous.
  2. The server spuriously omits some or all search results despite that content matching the user’s query. We call this opaque censorship due to there existing an ambiguity as to whether the query was censored or whether those matches never existed.

For search platforms which employ transparent censorship, measuring whether a query is censored is straightforward: test the query and check if there is a notification that the query is censored. However, for search platforms which censor opaquely, we were required to employ a more sophisticated methodology to distinguish between cases where there are genuinely zero matches and cases of opaque censorship. In the following section we discuss the method we used to distinguish between these cases.

Measuring whether a string of text is opaquely censored

On platforms which employ opaque censorship, in order to distinguish between cases where there are genuinely zero matches and cases where matches exist but are being opaquely censored, we use a technique of creating test queries for a string of text such that they should always return matches unless the string of text is censored, tailored to each platform. We call such a modified query a truism, which Wikipedia defines as “a claim that is so obvious or self-evident as to be hardly worth mentioning, except as a reminder or as a rhetorical or literary device”. Our truisms are search queries which should obviously return results but are used as devices to unambiguously detect the presence of censorship of a string of text.

As an example, on Baidu Zhidao, we create a truism by surrounding the string in question with “the -(” before the string and “)” after the string. Thus, for Baidu Zhidao, to test the string “习近平” we would test the truism “the -(习近平)”. On Baidu Zhidao this syntax indicates to the search platform to logically negate whatever is in between the parentheses and can be interpreted as searching for all results containing “the” which do not contain “习近平”. In the case of Baidu Zhidao and many other platforms, we have discovered that even content which is negated in a query can still trigger the query’s censorship.

Website Transformation Explanation
Baidu “site:com.cn␣-(” + string + “)” Since there is no character or string which is present on every web page, we negate the possibly-censored string and since Baidu queries cannot begin with a negation we use the “site:” operator to restrict results to a popular top-level domain.
Baidu Zhidao “the␣-(” + string + “)” Since posts exist containing “the”, this query requesting pages containing either “microsoft” or the possibly-censored string should always return results unless the string is censored.
Bilibili string Transparent censorship (although Bilibili does not report a censorship notification, a unique error code is returned by Bilibili’s API although this is invisible during ordinary usage)
Bing “microsoft␣|␣” + string Since web pages exist containing “microsoft”, this query requesting pages containing either “microsoft” or the possibly-censored string should always return results unless the string is censored.
Jingdong string Transparent censorship (although Jingdong does not report a censorship notification, the platform only fails to provide any recommendations when a query is sensitive)
Sogou “site:com.cn#齉” + string Although Sogou does not support disjunction (X | Y) or negation (-X), we discovered that Sogou supports a “site:” operator restricting results to the proceeding URL’s domain. While only the domain of the URL is used to restrict results, we found that other elements of a URL, such as a path or fragment may nevertheless be included although they have no effect on the results of the query. Therefore, by searching for “site:com.cn#XXX”, we are merely searching for any page with a top-level “com.cn” domain, and the fragment proceeding the “#” only exists to trigger censorship. Finally, since Sogou’s censorship system strips punctuation including hashes from queries before testing them for sensitive content, we place a “齉”, an extremely rare Chinese character unlikely to appear in censored content, between the “#” and the possibly-censored string to cause the censorship system to not join the possibly-censored string to “com.cn”. The “齉” character was also chosen because, unlike punctuation which causes Sogou’s URL parser to stop parsing the text as a URL, a “齉” is allowed by Sogou’s parser to be in a URL fragment.
Weibo string Transparent censorship (Weibo displays a censorship notification)

Table 4: Rules for transforming a query for string to a truism testing for censorship of string such that string is censored if and only if its corresponding truism returns zero results.

As another example, while Baidu Zhidao and many other platforms seemed to naively scan queries for the presence of strings to trigger censorship, Bing’s censorship system seemed clever enough to not allow the content of negated content to trigger censorship. However, we were still able to create a truism on Bing by searching for “(microsoft | 习近平)”. On Bing this syntax indicates to the search platform to return results that contain either “microsoft” or “习近平”. Since we know that there exist pages on the Internet containing “microsoft” and since “microsoft” is not censored, then, if there are no results, it must be because “习近平” is censored. See Table 4 for the rules which we used to create truisms to test each site employing opaque censorship.

Although we could theoretically construct such queries, note that truisms are not necessarily tautological, i.e., they are not guaranteed to return results a priori. For instance, we could construct a query “(习近平 | -习近平)” which would request any result that either contains 习近平 or does not contain 习近平 (i.e., every result). However, in our testing search platforms did not seem designed to recognize such queries as tautologies and often the results would be logically inconsistent (e.g., “习近平” reporting more results than “(习近平 | -习近平)”). As such, by “truism” we refer to queries which when not censored are merely certain to return results in practice although not necessarily a priori.

Since whitespace and punctuation characters can induce unpredictable behavior on censorship systems and because it can potentially interfere with the syntax added by our truisms, we strip all whitespace and punctuation from strings before testing. While it is possible that by performing this practice we may be failing to discover some censored rules which require punctuation, we found that in our previous study of WeChat that WeChat strips whitespace and punctuation from messages before testing them for censorship and that failing to strip these characters ourselves resulted in the spurious inclusion of them in our results. Therefore, out of caution, because we prefer accuracy of our results over the possibility of a slightly larger size of results, we strip whitespace and characters from our test strings before testing.

By using the method described in this section of testing using truisms, we can be certain that if our query does not have any returned matches then it must be due to the result of censorship. Thus far, we have discussed how we measure hard censorship, that is, censorship which denies the user from any matches. However, in this following section, we discuss how to measure a more subtle form of censorship in which the matches may be partially censored.

Measuring whether a query is partially censored

Across the three web search engines we tested, many queries which did return results only returned results linking to websites which were Chinese state-owned or state-approved media outlets (see Figure 4 for an illustration).

Figure 4: On Baidu, an example of a query whose results are only from Chinese state media.

Moreover, for some social media search platforms, we noticed that, for some queries that did return results, these results seemed to be only from accounts which have received a certain amount of verification or approval. We call this type of censorship in which results are only allowed from authorized sources soft censorship and censorship in which no results are allowed hard censorship (see Table 5 for a breakdown of each platform we discovered performing soft censorship).

Website Soft censorship
Baidu Only shows results from authorized domains (typically Chinese government sites, Chinese state media, etc.)
Bing Only shows results from authorized domains (typically Chinese government sites, Chinese state media, etc.)
Douyin Only verified accounts
Sogou Only shows results from authorized domains (Chinese government sites, Chinese state media, etc.)
Weibo Only verified accounts

Table 5: Platforms which we discovered performing soft censorship and the manner in which they perform it.

To detect this form of soft censorship, for each web search engine, we modified its truism by restricting the results to only be allowed from unauthorized sources. For example, on Baidu, we only allow results from microsoft.com, a site we chose because it is both popular and accessible in China but foreign operated and unlikely to be pre-approved for voicing state propaganda. For Baidu, we surrounded the tested string with “site:microsoft.com -(” on the left and “)” on the right in order to transform it into a truism and test it for soft censorship but with the restriction that results were only allowed from an unauthorized source. Thus, for the string “彭帅”, we would test the truism “site:microsoft.com -(彭帅)”, which can be interpreted as searching for any page on microsoft.com not containing “彭帅”. See Table 6 for the rules which we used to create truisms to test each site employing soft censorship.

Website Transformation Explanation
Baidu “site:microsoft.com␣-(” + string + “)” Same as in Table 4 except restricted to microsoft.com, an unauthorized site.
Bing “site:microsoft.com␣(microsoft␣|␣” + string + “)” Same as in Table 4 except restricted to microsoft.com, an unauthorized site.
Douyin string + “␣‰‰” Douyin normally displays results for any query, no matter if there exist no results which are an exact match or even a close match. However, when queries are soft censored, Douyin applies two restrictions to results, namely that results must contain all words in the search query and that results must be from verified accounts. As such, we additionally search for “‰‰”, which has not been posted by any verified account, so that soft-censored queries will never display search results.
Sogou “site:microsoft.com#” + string Same as in Table 4 except restricted to microsoft.com, an unauthorized site.
Weibo “‰‰␣-(” + string + “)” Only non-verified accounts have posted the string “‰‰” and there is no character or string which is in all posts containing “‰‰”.

Table 6: Rules for transforming a query for string to a truism testing for soft censorship of string such that string is soft censored if and only if its truism returns zero results.

Notably, although many sensitive queries on Douyin returned zero results, we did not find any evidence of hard censorship on Douyin that could not be explained by the soft censorship system which we explain in Table 6. As such, on Douyin, we only measure its soft censorship system.

Isolating which keywords are triggering censorship

Thus far we have discussed how to determine whether a string of text is either hard or soft censored across each of the search platforms which we tested. However, given that a string of text is censored, we still desire to know which keyword or combinations of keywords present in that text are responsible for triggering its censorship. In this section, we outline the method we employ of isolating which combination of keywords is triggering a text’s censorship by making additional queries.

To isolate keywords triggering censorship of a string of text, we make use of an algorithm called CABS which we originally introduced in 2019 and continue to maintain here. Our original algorithm was motivated by discovering censored keyword combinations on WeChat, which we modeled as a “list of unordered sets”, but, as we model censorship in this work as a “list of ordered sequences”, we adapted the algorithm to fit this model. Fortunately the changes required were trivial, essentially replacing all sets and set operations with tuples and their corresponding tuple operations (e.g., set union is replaced with tuple concatenation). To fully understand the algorithm and to access its code, we recommend visiting the previous links in this paragraph. However, in the remainder of this section we will briefly outline the intuition behind how this isolation algorithm works.

The algorithm works by performing bisection and attempting to truncate as much of the text being isolated at a time while preserving the property that it is still censored. For example, initially it will attempt to remove the second half of the text and measure if it is still censored. If it is, then it will attempt to remove the last three quarters of the text. If it is not, then it will attempt to remove only the last quarter of the text. By iteratively repeating this procedure, the algorithm eventually discovers the final character of one of the keywords triggering the censorship of the string. It next attempts to discover the first character of this keyword. Once the complete keyword is discovered, the algorithm tests if the keyword it has discovered thus far is sufficient to trigger censorship. If not, it repeats this process of finding another keyword in the censoring keyword combination on the remaining text up to but not including the final character of the keyword most recently discovered. It repeats this process until enough keywords have been discovered to trigger censorship, producing the censored keyword combination in its entirety.

There are, as it turns out, many subtleties to doing this correctly and efficiently, especially when keywords can overlap or when there may be present multiple keyword combinations triggering censorship. However, through careful design and testing, our algorithm is correct even in the presence of such corner cases.

Website Character Explanation
Baidu “‰” Efficiently encoded in GB18030, the encoding under which our testing query length on Baidu is limited.
Baidu Zhidao “‰” Efficiently encoded in GBK, the encoding under which our testing query length on Baidu Zhidao is limited.
Bilibili “” Sufficient to separate keywords.
Bing “” In our testing, we found that different join characters could produce different results, suggesting that Bing may use complicated rules for how keywords in a single rule may be separated. In our report, we use “” as our “join” character, although another choice may have split keyword combinations into a greater number of keywords.
Douyin “” Sufficient to separate keywords.
Jingdong “␣” Sufficient to separate keywords.
Sogou “齉” Efficiently encoded in GB18030, the encoding under which our testing query length on Sogou is limited, and, unlike punctuation or other characters such as “‰”, can be parsed by Sogou as part of a URL fragment (see Table 4). We chose this specific Chinese character because it is extremely rare (refers to a stuffy, nasal voice) and thus unlikely to collide with actual censorship rules but occurs in the basic multilingual plane.
Weibo “‰” Efficiently encoded in GB18030, the encoding under which our testing query length on Weibo is limited.

Table 7: For each tested search platform the “join” character that is used when isolating the combination of keywords triggering the censorship of a string of text.

The one way in which the algorithm must be adapted to a given search platform is by choosing a “join” character. This selection is necessary as not every platform considers the same characters as splitting a keyword. For instance, on one platform, putting spaces in between the characters of a censored keyword may not prevent it from being censored but on another it may. A desirable “join” character for a platform is one whose insertion into a censored keyword would prevent it from censoring but also one that it is unlikely to appear in censored keywords that we wish to measure and that can be encoded efficiently in whatever character encoding a search platform internally uses. For a breakdown of the “join” characters that we used for each tested search platform with corresponding motivations, please see Table 7.

By using this algorithmic technique, we can determine the exact keyword or combination of keywords necessary to trigger censorship of a censored string of text. The results are not statistical inferences, approximations, or in any way probabilistic. Moreover, the algorithm is agnostic of and makes no assumptions concerning language, including concerning parts of speech, semantics, word boundaries, or other aspects of language, and it effortlessly generalizes to Chinese, English, Uyghur, and Tibetan languages, among others.

Overcoming testing hazards

During the preliminary design and testing of our methods, we observed that our methodology would need to overcome multiple hazards in order to provide thorough and accurate results, including captchas, various restrictions on query length, and inconsistent results returned by search platforms. Below we outline how we overcome these hazards.

Captchas

We found that, after a period of automated testing, Sogou and Baidu began displaying captchas instead of displaying search results until the captchas were solved. We did not investigate attempting to solve the captchas automatically. However, for Sogou, we found that whenever presented with a captcha we could restart the browser session to resume testing for some substantial period of time until the next captcha appeared, at which point we could simply restart the browser session again. For Baidu, restarting the browser session was typically ineffective. However, we found that solving a captcha would allow a browser session to test for 24 hours uninterrupted by captchas. Thus, to instrument Baidu’s testing, every 24 hours we manually solve a captcha for each browser session, requiring only seconds of manual intervention each day. However, if Baidu’s captcha displays became more frequent or if we wanted to completely automate the testing, future work might look at applying software designed to automatically solve these captchas.

Douyin also displayed captchas. However, unlike with Sogou and Baidu, even after solving Douyin’s captchas, repeated search querying would inevitably begin yielding zero results for any search query, regardless of its sensitivity. As such, we were not able to complete every experiment with Douyin, as we stopped testing it early in our analysis due to this limitation.

Query length limitations

The search platforms we tested have limitations in the length of query which we could test. Exceeding these limits had various consequences, such as the platform returning an error message, silently truncating the query, or all content beyond the limit evading censorship. As such, for each platform, we performed testing to determine the value of any applicable limit. As characters can take varying space in different representations or encodings, we also had to determine the unit of the limit, which we found to vary across platforms as being a function of the number of raw Unicode characters or the number of bytes in some character encoding such as UTF-8, UTF-16, GB18030, etc. (see Table 8 for a complete breakdown).

Website Maximum query length Encoding and unit
Baidu 76 GB18030 bytes
Baidu Zhidao 76 GBK bytes
Bilibili 33 Unicode characters
Bing 150 UTF-8 bytes
Douyin 202 UTF-16 bytes
Jingdong 80 Unicode characters
Sogou 80 GB18030 bytes
Weibo 40 GB18030 bytes

Table 8: For each search platform, the maximum query length we used in testing.

Our code was written to ensure that queries were never tested which exceeded a platform’s limits to ensure the reliability of our results.

Inconsistent search results

We observed inconsistencies in the search results with some search engines during our testing. When we searched for a truism for the first time, we found that some platforms would occasionally return no results for a truism, even if it is not censored. Testing it again would yield results. We hypothesize that the eccentric queries which we construct would sometimes overwhelm the search platform but, once it had sufficient time to be primed, it would then return results for subsequent searches using that query. For other platforms, we also observed cases in which it seemed that we would see a small number of censorship rules being applied inconsistently. We hypothesize that such inconsistent observations may have resulted from load balancing between servers with small differences between the censorship blocklists with which they had been deployed. In any case, to make our measurements robust to these and other inconsistencies, we apply the following algorithm, expressed below in Python-like pseudocode, which effectively retests a potential keyword combination an additional two times over the span of three hours before considering it a censored keyword combination:

 1: def robust_isolate(censored_text):
 2:     combo = isolate(censored_text) # returns list of keywords
 3:     for round in range(2): # retest an additional two times
 4:         wait_for_an_hour()
 5:         last_combo = combo
 6:         censored_text = ''.join(combo)
 7:         combo = isolate(censored_text) # returns list of keywords
 8:         if combo != last_combo:
 9:             if len(combo) < len(last_combo) or (len(combo) == len(last_combo) and len(''.join(combo)) < len(''.join(last_combo))): # force a measure of progress to ensure termination of algorithm
10:                 return robust_isolation(''.join(combo)) # restart with new
11:             return None # give up
12:     return combo

In this code, in the event that we discover an inconsistent result, we do one of two things depending on how the new result compares to the previous one. If, compared to the previous result, the new result’s keyword combination has either fewer constituent keywords or if it both has the same number of constituent keywords but the sum of the lengths of each of its constituent keywords is less than those of the previous result, then we restart the robust isolation process from scratch on the new keyword. Otherwise, we simply give up attempting to isolate the triggering keyword combination from the given censored string. We have this rule in place to ensure that the isolation of the keyword combination is making some measure of progress, in either having fewer keywords or in having the same number of keywords but shorter ones. This policy ensures that, in an environment where servers may be giving inconsistent results, the algorithm still terminates, either by eventually returning a reliable result or by failing. Although we did not collect data specifically pertaining to this matter, we believe from casual observation that such failures are exceedingly rare and occur only when nothing else could have been easily done to obtain a reliable result.

Experiments

In this work we perform three experiments using different sampling methodologies to address different research questions that we had. In our first two experiments, we test search platforms for the censorship of people’s names and of known sensitive content, respectively. We also present a third, ongoing experiment from which we already have preliminary results, in which we test text for censorship from daily news sources. In the remainder of this section we set out the design of these three experiments in greater detail.

Experiment 1: Measuring censorship of people’s names

In our first experiment we test people’s names. Individuals or their names have the following desirable properties:

  • Individuals can represent highly sensitive or controversial issues.
  • Unlike more abstract concepts, a comprehensive sample of notable people and their names can be automatically curated and enumerated into a large test list.
  • As opposed to a list of handpicked keywords or a list of sensitive keywords censored in other Chinese products, a list of people’s names is not biased toward the sort or style of keywords censored in other Chinese products or toward a researcher’s preconceptions.

To facilitate this experiment, we used a list of 18,863 notable people whose names which we had previously curated from Wikipedia in 2022. The manner in which we curated these is spelled out in a previous report, but, at a high level, these names were collected from Wikipedia by looking for people whose articles had a sufficiently high number of Wikipedia views and whose names had a sufficiently high amount of search volume on Microsoft Bing. While this list of notable names inevitably contained the names of famous Chinese politicians, political dissidents, and others whom we might expect to be the targets of censorship, the criteria through which we selected these names was designed to be unbiased and to also produce names for testing whose censorship we might not expect with the intention that we find surprising results.

In this experiment we test each person’s name individually. For each name on this list, to generate a test string, we take the person’s name as expressed in the Wikipedia article title and append, if different, the name in simplified Chinese characters and append, if different, the name in traditional Chinese characters, forming a final test string of between one and three variations of the name concatenated together. If in testing we find that the test string is censored, we then use our isolation algorithm to isolate a keyword combination triggering its censorship.

While isolating the triggering keyword combination may not seem necessary when individually testing keywords such as people’s names, as it might seem apparent that sensitivity of that person’s name must be responsible for triggering the censorship, we found it helpful in discovering cases where names were collaterally censored, either by accidentally containing another part of another censored name (e.g., “习” [Xi]), or by accidentally containing other sensitive characters triggering its censorship (e.g., “伍富橋” [Alvin Ng] censored due to containing the character “橋” [bridge] following the Sitong Bridge Protests).

During this experiment, for each platform tested, we record each censored name and the keyword combination triggering its censorship.

Experiment 2: Measuring censorship of known sensitive content

In our second experiment we test from a compilation of known sensitive content. Previous work has shown that, to comply with Chinese censorship regulations, companies are generally responsible for curating their own censorship lists and that lists used by any two companies will, on average, have little overlap. However, due to the onerous task of compiling these lists, which may contain tens of thousands of keywords or more, companies are often reluctant to invest the resources required to develop their own lists, instead opting to use whatever lists that might be most easily available. Software developers have been known to take censorship lists with them when leaving companies and to later use them in new products. Furthermore, when comparing a list across a database consisting of as many as thousands of other previously discovered Chinese censorship lists, it can be possible to find one or more lists from which the list in question may have been derived (or lists which may have been derived from the list in question) due to an amount of overlap unexplainable by chance. Therefore, testing from a large sample of other products’ lists can be an effective way to find what another product is censoring.

As such, in our second experiment, we sample tested by drawing from a database of thousands of Chinese censorship lists previously discovered on other platforms consisting in aggregate of 505,903 unique keywords. Instead of testing keywords individually, we treated the entire database as a large text by concatenating the unique keywords together ordered by frequency and secondarily lexicographically. By treating the database as a single, large text, we were able to test more content at once, limited only by each search platform’s limitations on query length, decreasing the time required to test and increasing the chance of discovering keyword combinations consisting of more than one keyword. When we discover a censored string of text, we isolate its triggering keyword combination and record it. We then resumed testing at the character after the censored keyword combination’s earliest character (i.e., after the keyword combination’s earliest keyword’s earliest character).

Experiment 3: Ongoing testing from news articles

In our third experiment, we test from news articles in a perpetual, ongoing fashion. Our motivation for choosing news articles is that they are easy to collect, contain words related to current events, and often cover politically sensitive topics. Furthermore, they may be directly the desired object of inquiry on a web search platform or the object of discussion on a social media network, as we found many titles of or long phrases in news articles censored on search platforms.

To facilitate this experiment, every 60 seconds, we check for and collect news articles from 16 different RSS feeds spanning Mandarin, Cantonese, Tibetan, and Uyghur languages as well as editorial stances which range from expressly pro-Beijing to expressly critical of Beijing including those news sources with stances in between (see Table 9 for a complete list).

Table 9: RSS news sources used in testing.

For the purposes of testing, we consider each article text a concatenation of its RSS title, description, and URL. On each search platform, we then test each article as much at a time as possible, as limited by the platform’s maximum query length. As in Experiment 2, when we discover a censored string of text, we isolate its triggering keyword combination and record it, and we then resume testing at the character after the censored keyword combination’s earliest character.

Experimental setup

We coded an implementation of our experiments in Python using the Selenium Web browser automation framework and executed the code on Ubuntu Linux machines. We tested each search platform from a Toronto network except for Bing, which we tested from a Chinese vantage point using a popular VPN service. Experiment 1 was performed in October 2022. Experiment 2 was performed in February 2023. Experiment 3 began January 1, 2023, and is ongoing as of the time of this writing.

Results

In this section we detail the results of our first two experiments and present preliminary results from our third.

Experiment 1: Censorship of people’s names

Among the 18,863 names tested from Wikipedia, we found a combined 1,054 unique names — over 1 in 18 — censored across the search platforms which we tested. Among the unique censored names, 605 were hard censored on at least one platform, and 449 were only ever observed to be soft censored. Among platforms which performed both hard and soft censorship, such platforms performed very little hard censorship, suggesting that they prefer to perform soft censorship when operators possess the capability. From the censors’ perspective, soft censorship may be more desirable as the way in which it controls information is less obvious, but, from the platform operators’ perspective, it may be also desirable, as it creates less friction during a user’s interaction with the platform because a user, if receiving no results on one platform, may be tempted to try switching to another.

Among the platforms analyzed, we found Weibo to target the highest number of names (474) for some type of censorship, and among the web search engines, we found similar levels of censorship, with Sogou targeting 282, Baidu targeting 219, and Bing targeting 189 with some type of censorship. Strictly concerning hard censorship, web search engines targeted very few names. Baidu hard censored “习明泽” [Xi Mingze], Xi Jinping’s daughter, and “徐晓冬” [Xu Xiaodong], a mixed martial artist with anti-China political views. Seemingly beyond coincidence, Sogou hard censored the same two names, although Sogou targeted Xi Mingze with a more broad rule: “习 + 明泽” [Xi + Mingze]. These similar findings are especially surprising as Xu Xiaodong, while a sensitive name, would not seem as sensitive or as well known a name as many others in Chinese politics. We did not find Bing to hard censor any names.

 For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the number of names censored in that category.
Figure 5: For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the number of names censored in that category.

To better understand the motivations behind search platforms’ censorship of people’s names, we developed a codebook to categorize each censored keyword according to a person’s significance, particularly in the context of Chinese political censorship. Following grounded theory, we first went through all censored names to discern broad categories and repeated themes. This iteration led to seven high-level themes for the codebook. We then reviewed all of the censored names again and applied an appropriate label to each keyword (see Figure 5).

We categorized sensitive names into seven common themes: “Political” (e.g., political leaders and dissidents, major historical events, criticism of the Communist Party, or proscribed political ideas), “Religious” (e.g., banned religious faiths, spiritual leaders, and religious organizations), “Eroticism” (e.g., pornographic material, adult film actors, sex acts, adult websites, and paid sexual services), “Collateral” (names collaterally censored by a censorship rule targeting someone or something else), “Business” (businesspeople who do not have a clear political motivation for their censorship), “Entertainment” (celebrities, artists, singers and related figures in the entertainment and associated industries who do not have a clear political motivation for their censorship), and “Other” (a residual category that contains content which either does not fit within the other six categories and terms which have been censored for unclear reasons).

We found that most names that platforms censored were for political motivations, whether to shield leaders and other pro-Chinese-Communist-Party members from criticism or to silence dissidents. However, we also found that many names were collaterally censored by rules clearly targeting content other than that person or their name. As examples in English, Hong Kong musical artist “DoughBoy” was soft censored on Weibo for containing “ghB”, GHB being a drug illegal in China and broadly elsewhere, and Baidu Zhidao censored South Korean band “FLAVOR” for containing “AV”, an abbreviation for adult video. As examples in Chinese, Polish Violinist “亨里克维尼亚夫” [Henryk Wieniawski] was soft censored on Weibo for containing “维尼” [Winnie (the Pooh)], a common mocking reference to Xi Jinping, and Microsoft Bing soft censored Chinese actress “习雪” [Xi Xue] for containing “习” [Xi], Xi Jinping’s surname. Weibo’s soft censorship collaterally affected the largest number of names due to the platform’s use of broad censorship rules. In second and third are Jingdong and Baidu Zhidao, respectively. These examples of collateral censorship speak to the methodological importance of not just testing whether a string is censored but also of understanding the exact censorship rule targeting its censorship to avoid misattributing the censor’s motives.

Comparing the soft censorship of social networks Douyin and Weibo, we found that they censor a similar number of names under political motivation, with Douyin censoring slightly more. However, due to its use of broader rules, Weibo had much more names censored collaterally, whereas Douyin’s more specific rules were able to pinpoint political names without collaterally affecting any others in our measurements.

Experiment 2: Censorship of known sensitive content

Among our testing spanning 505,903 unique, previously discovered censored keywords, we found 60,774 unique keyword combinations censored across all search platforms which we investigated. Due to Douyin aggressively fingerprinting and banning our testing, we were unable to complete this experiment for Douyin. We also omit Bing hard censorship results from our discussion in this section as we only discovered four keyword combinations hard censored by Bing, and we believe that these keyword combinations were measurement artifacts of attempting to measure keyword combination censorship of a machine learning classifier trained to detect pornographic queries (we will discuss this more in the section “Evaluation of our model” later below).

For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the estimated number of keyword combinations discovered in that category.
Figure 6: For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the estimated number of keyword combinations discovered in that category.

To understand the type of content censored on each platform, we randomly sampled 200 keyword combinations censored on each platform and categorized them as we did for the previous experiment into themes which resemble but are not all the same as the ones in the previous experiment: “Political” (e.g., political leaders and dissidents, major historical events, criticism of the Communist Party, or proscribed political ideas), “Religious” (e.g., banned religious faiths, spiritual leaders, and religious organizations), “Eroticism” (e.g., pornographic material, adult film actors, sex acts, adult websites, and paid sexual services), “Illicit Goods” (e.g., narcotics, weapons, and chemicals), “Other Crime” (e.g., gambling, fraud, extortion, counterfeiting, and private surveillance), and “Other” (a residual category that contains content which either does not fit within the other five categories and terms which have been censored for unclear reasons). For each platform, based on the proportion of keyword combinations that we found in each category in our random sample, we then estimated the total number of keyword combinations in each category for each platform.

We based our criteria for these six categories on what we found through examining censored content on the eight platforms listed in Table 3. Our six categories also roughly match the categories of prohibited content listed in Chinese government legislation (Figure 1), the targets of internet purification special operations (Figure 2), official announcements on reports of undesirable or harmful online information (Figure 2), and the nine categories of illegal, undesirable, or harmful information listed on the Cyberspace Administration of China’s Reporting Center (Figure 3). Below we describe our findings from each of these categories in more detail.

Political

We found that a large proportion of censored names of political leaders refer to Xi Jinping’s name “习近平” or his family. Examples include his current wife “彭丽媛” [Peng Liyuan], his former wife “柯玲玲” [Ke Lingling], his sister “齐桥桥” [Qi Qiaoqiao], and his daughter “习明泽” [Xi Mingze] (see Table 10 for a breakdown per platform).

Baidu (hard) Baidu (soft) Baidu Zhidao Bilibili Bing (Soft) Jingdong Sogou (hard) Sogou (soft) Weibo (hard) Weibo (soft)
1,332 390 510 282 64 53 91 1,754 0 33

Table 10: For each platform, a breakdown of the estimated number of keyword combinations discovered related to Xi Jinping or his family based on sample testing.

Censored terms referring to Xi Jinping included the hard censoring of numerous homoglyphs (e.g., “刁斤干” on Baidu) of Xi’s name, as well as hard censoring of terms like “xi + 包子” [xi + bun] on Bilibili, a reference to earlier propaganda campaigns which painted Xi as an avuncular figure with simple tastes. Other references to Xi are soft censored, including some homonyms (e.g., “吸精瓶”) and the term “三连任” [three consecutive terms] on Bing, a reference to Xi’s third term as China’s paramount leader. References to Xi’s personal life are also widely censored. These include terms like “离婚 + 习近” [divorce + Xi Jin], possibly referring to Xi’s first marriage to Ke Lingling. References to Xi Jinping’s daughter Xi Mingze, such as “明泽 + 公主” [Mingze + princess] on Sogou, are hard censored. Little information is publicly available about Xi Mingze, but she is believed to have enrolled in Harvard University in 2010 under a pseudonym. Terms related to other Xi family members are also censored, including references to rumors that Xi Jinping’s elder sister Qi Qiaoqiao has Canadian citizenship (e.g., “加拿大籍 + 习近平 + 大姐” [Canadian nationality + Xi Jinping + eldest sister]) which are hard censored on Baidu.

While references to Xi Jinping are the most widely censored among all of China’s political leaders, references to other past and current political figures are also censored. Some references to former premier “温家宝” [Wen Jiabao], including homonyms of his name (e.g., “温加煲” on Bilibili), are soft censored, as are phrases like “温 + 贪污” [Wen + corruption] on Weibo, the latter referring to accounts of alleged Wen family corruption covered in an investigative report by The New York Times.

Terms indicating criticism of the Communist Party were also subjected to censorship. These include homonyms for Communist Party (e.g., “共抢党”, soft censored on Weibo), as well as calls for the Communist Party to step down (e.g., “GCD + 下台” [GCD + step down] on Baidu). Some hard-censored slogans, like “洗脑班” [brainwashing class] on Jingdong and “共产党灭亡” [death of the Communist Party] on Bilibili, are associated with material produced by the Falun Gong spiritual movement. For example, the term “退党保平安” [quit the Party to stay safe and peaceful], hard censored on Bilibili, refers to a campaign launched by the Falun Gong to encourage Chinese citizens to quit the Communist Party, Communist Youth League, and the Young Pioneers. Other censored terms refer to the 1989 Tiananmen Square protests and subsequent massacre (e.g., “TAM学生” [TAM students], soft censored on Bing) and notable dissidents (e.g., “晓波刘” [Xiaobo Liu], soft censored on Bing) and “吾尔开希” [Wu’er Kaixi], hard censored on Bilibili).

Religious

Much of the censored content concerning religion refers to banned spiritual groups, in particular the Falun Gong. These include homonyms for Falun Gong (e.g., “法仑功”) and references to the persecution of Falun Gong devotees (e.g., “弟子 + 迫害 + 洗脑” [disciple + persecution + brainwash]), both soft censored on Baidu. References to other banned spiritual groups are also soft censored, like “觀音法門” [Guanyin Famen, in traditional characters] on Sogou and “观音法门” [Guanyin Famen, in simplified characters] on Bing and “狄玉明” [Di Yuming], the spiritual leader of Bodhi Gong.

Not all censored religious terms refer to banned spiritual groups. The title of Tibet’s exiled spiritual leader, “达賴喇嘛” [Dalai Lama], is hard censored on Jingdong. Terms related to Christianity are also hard censored, though for reasons which are not immediately clear. “耶稣 + 少儿” [Jesus + children], “青少年 + 上帝” [youths + God], and “青少年 + 基督教夏令营” [youths + Christian summer camp] are all hard censored on Jingdong. While authorities have not banned Catholicism and Protestantism, Christian religious activities are strictly monitored throughout China. Authorities also routinely surveil, harass, and detain practitioners of underground house churches and Christian-influenced banned faiths like Church of Almighty God (“全能神教会”). The hard censoring of references to youths and Christianity may also be in response to reported state efforts to prevent those under the age of 18 from participating in religious education.

Eroticism

Censored terms in these categories refer to various kinds of pornographic material, acts or body parts, and paid sexual services. This includes terms like “色情无码” [uncensored pornography], soft censored on Bing, Japanese adult film actor “唯川纯” [Jun Yuikawa], hard censored on Baidu Zhidao, and sex acts like “舔嫩逼” [lick tender pussy], hard censored on Bilibili. Other censored terms refer to soliciting sex workers, such as “包夜 + 按摩” [overnight + massage], soft censored on Baidu, or “婊子上门” [visiting prostitutes], hard censored on Baidu Zhidao, or specific body parts, like “大屌” [big dick], soft censored on Sogou.

Illicit Goods

Many censored terms concerning illicit goods refer to drugs. Some refer to selling drugs like “卖 + 咖啡因” [sell + heroin] or “售 + 摇头丸” [sale + ecstasy], both hard censored on Bilibili, or “售 + 地西泮” [sale + diazepam], soft censored on Baidu. Others terms concern manufacturing drugs such as “制作 + 毒药” [crafting + poison], soft censored on Sogou, or “提炼 + 三甲氧基安非他明” [refining + Trimethoxyamphetamine], hard censored on Bilibili.

Censored terms also refer to weapons, including euphemistic references to particular weapons (e.g., “气狗” [air dog] or air gun, hard censored on Jingdong), their sale (e.g., “批发 + 弓弩” [wholesale + bow and crossbow], hard censored on Weibo), or their manufacturing (e.g., “制作 + 枪” [crafting + gun], hard censored on Weibo).

Chemicals also feature as censored terms, such as “光气 + 提供” [carbonyl chloride + supply], soft censored on Baidu, and references to buying particular kinds of insecticide (e.g., “敌杀磷 + 购买” [Dioxathion + buy], soft censored on Sogou). It is unclear why references to particular chemicals have been censored, though in some cases censorship may be related to the potential use of some chemicals in the manufacturing of narcotics or the production of explosives.

Gambling terms also make up a large number of censored terms related to illicit goods. These gambling-related terms include the names of particular websites (e.g., “金沙sands线上娱乐场” [Golden Sand sands online resort], a reference to Sands Casino in Macao, hard censored on Sogou), particular kinds of gambling (e.g., “赌马 + 开户” [horse betting + open account], and even “online casinos” in general (“网上赌场”, soft censored on Weibo).

Other Crime

This category of prohibited content contains references to a range of illicit or criminal activity. Some refer to various forms of fraud or forgery, including selling high quality counterfeit identity cards (“卖高仿身份证”, soft censored on Bing) or searches for police uniforms (“警服”, hard censored on Jingdong).

Other censored terms include references to adopting babies (“收养 + 宝宝” [adoption + baby], soft censored on Sogou) to selling organs (“售 + 肝臟” [sell + liver], soft censored on Weibo), potentially censored due to police efforts to deal with child trafficking and kidnapping and illegal organ harvesting, respectively. References to other illicit activities like live broadcasting suicide (“自杀 + 直播” [suicide + live], soft censored on Sogou), hiring kidnapping services (“替人绑架” [kidnap for someone], hard censored on Weibo), or selling diplomas (“卖 + 文凭” [sell + diploma], soft censored on Baidu) are also censored, as are terms related to buying commercial surveillance devices (e.g., “供应 + 卧底窃听软件” [sale + undercover eavesdropping software], soft censored on Baidu).

Other

This residual category included censored terms which did not clearly fit within the other five categories. Some of these were profanity, both in Chinese (e.g., “艹你妈” [fuck your mom], hard censored on Jingdong) and English (e.g., “Fucker”, soft censored on Weibo). Others referred to news websites blocked in China, like Radio Free Asia (“自由亚洲电台”, hard censored on Bilibili) and the Taiwanese newspaper Liberty Times (“自由时报”, soft censored on Sogou).

References to censorship itself and how to circumvent content management controls were also censored. This includes references to the Great Firewall (“防火长城”, soft censored on Sogou), “翻牆” [leaping over the wall], soft censored on Weibo, and “网络发言防和谐” [online speech anti-censorship], soft censored on Baidu. In other cases, censorship reflected the corporate concerns of a specific platform. Jingdong hard censored the term “狗东” [Dog East or “gou dong”], a satirical play on words referring both to the companies name Jingdong and the use of a cartoon dog as the company’s mascot.

Impact of censorship across platforms

Although measuring the number of censorship rules targeting a type of content may be a valuable measure of the amount of attention or resources that a platform has invested into censoring that content, it may be a misleading measure of the actual impact of that censorship. For instance, when looking at Baidu’s soft censorship rules, we found 559 keyboard combinations containing the character “习” [Xi]. Many of these are homonyms of Xi Jinping’s name (e.g., “习进瓶”) or derogatory references (e.g., “习baozi”). Although Baidu uses a large number of rules containing “习”, Bing has only has one such soft censorship rule containing “习”, but it is to simply censor all queries containing the character “习” without any additional specificity. From this, a naive analysis might conclude that Baidu’s censorship of Xi is 559 times broader than Bing’s since it has 559 times as many rules, but yet Bing’s single, broad rule censors more Xi-related queries than Baidu’s long list of specific queries.

To attempt to measure which search platforms had the broadest censorship, we devised a new metric. At first, as an attempt to approximate the number of queries a keyword combination is censoring, we considered using search engine trends data, but such data appeared to have two major issues: first, trends data appeared to have data for only the most common of queries, and, second, trends data for a query were only for queries which exactly matched as opposed to performing a substring match. For example, trends data for “习近” [Xi Jin] would show fewer results than for “习近平” [Xi Jinping], despite “习近” being a substring of “习近平”. Thus, to approximate how many queries were censored by a rule censoring all queries containing “习近”, we would have to anticipate all such queries that one might make which contain “习近” and then add up the trends data for each.

Instead, we adopted a different metric, which we call the impact score, which was devised to approximate the number of web pages censored by a keyword combination. To determine the impact score for a keyword combination rule, we created a query where each keyword in that keyword combination was surrounded by quotation marks. For instance, for the keyword combination “习 + 二婚”, we recorded the number of results for the following search query:

"习" "二婚"

This query requests all pages containing both the exact phrase “习” and the exact phrase “二婚”, which mirrors the corresponding keyword combination censorship rule which censors any query containing those exact phrases. For this testing, to obtain the number of web pages impacted by a keyword combination, we measured using Bing as accessed from the Canadian region, as, to the best of our knowledge, Bing has not implemented Chinese political censorship of search results in this region. Note that we apply this metric even to search platforms which are not web search engines, even though these platforms are not searching web pages but rather other items such as store products, microblog posts, or social media videos, as we suspect that this metric can still approximate the impact of the censorship on these platforms as well.

For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the estimated sum of the impact scores of each keyword combination in that category.
Figure 7: For each platform, for hard and (if applicable) soft censorship, a breakdown by category of the estimated sum of the impact scores of each keyword combination in that category.

To understand the type of content most likely to be censored on each platform, we randomly sampled 200 keyword combinations from each platform by performing a weighted uniform sampling, with replacement, weighted by the impact score of each keyword combination. We then categorized these keyword combinations using the same codebook as before. These results are characteristically different than before, with Weibo now demonstrating the highest level of total censorship among the search platforms we analyzed (see Figure 7).

Analyzing by category, we find that Jingdong has the highest level of censorship of illicit goods. This finding is not unexpected given that Jingdong is the only e-commerce platform that we analyzed and thus could be expected to have broader filtering of illegal goods.

Among web search engines, a breakdown by category of the estimated sum of the impact scores of each soft-censored keyword combination in that category.
Figure 8: Among web search engines, a breakdown by category of the estimated sum of the impact scores of each soft-censored keyword combination in that category.

Turning our focus to the three web search engines, we find that Sogou has the highest level of overall censorship. Compared to Baidu, Bing has slightly less overall censorship than Baidu. However, breaking down by category, Bing’s level of censorship of political and religious topics exceeds Baidu’s, with Baidu’s filtering of content related to eroticism, illicit goods, and other crimes exceeding Bing’s. This finding suggests that Bing is not a suitable alternative to Baidu for users attempting to freely access political or religious content and that to access such content Baidu may be a better choice despite it being operated by a Chinese company.

Experiment 3: Ongoing testing from news articles

In this section we briefly discuss preliminary results from our ongoing experiment measuring censorship rules by testing news articles.

Baidu Baidu Zhidao Bilibili Bing Jingdong Sogou Weibo
1,493 1,426 165 115 329 4,438 908

Table 11: For each platform, as of April 2, the number of new censored keyword combinations which we discovered outside of the previous two experiments.

As of April 2, 2023, since our testing which began January 1, 2023, we have discovered between 155 and 4,438 new keyword combinations on each platform analyzed. Unfortunately we have limited ability to know if a newly discovered keyword combination was recently added or if it had merely been recently discovered. However, while many of the newly discovered censorship rules could not be shown to be recently added, many others referenced events that occurred since January 1, 2023, seemingly requiring them to have been introduced since then.

As examples, Weibo soft-censored “中国间谍气球” [Chinese spy balloon], referring to a Chinese balloon shot down on February 4, 2023, over the United States which the United States and Canada accused of being used for surveillance, as well as “阮晓寰” [Ruan Xiaohuan], an online dissident who was recently convicted of inciting subversion to state power. Baidu hard censored “逮捕令 + 普京 + 习近平” [Arrest Warrant + Putin + Xi Jinping], referring to an arrest warrant issued on March 17, 2023, by the International Criminal Court for Vladimir Putin. In the days following the issuance, Xi Jinping would visit Putin in Russia. Sogou’s soft censorship of the Ukraine crisis used a large number of very specific keyword combinations, many of them referencing 2023 developments:

  • 乌克兰 + 王吉贤 [Ukraine + Jixian Wang]
  • 博明驳斥美国 + 台湾乌克兰化谬论 [Bo Ming refutes the US + fallacy of Ukrainization of Taiwan]
  • 入侵乌克兰一年后 + 俄罗斯依赖中国 [A Year After Invading Ukraine + Russia Depends on China]
  • 成为下一个乌克兰 + 台湾 [Be the next Ukraine + Taiwan]
  • 王吉贤 + 乌克兰 [Jixian Wang + Ukraine]
  • 抗议 + 俄罗斯 + 乌克兰战争 [Protests + Russia + Ukraine War]
  • 大疆 + 无人机 + 乌克兰 [DJI + Drones + Ukraine]
  • 马斯克 + 乌克兰 + 星链 [Musk + Ukraine + Starlink]
  • 俄罗斯 + 入侵 + 乌克兰 + 一年 [Russia + invasion + Ukraine + year]

As we have previously mentioned, our isolation algorithm generalizes effortlessly to all languages. For example, we found that many platforms censored keyword combinations containing Uyghur script. Here are two examples of Bing targeting Uyghur users referring to issues of Xinjiang independence:

  • ئەركىنلىك [Freedom]
  • ۋەتىنىمىز [Our homeland]

This experiment has only recently begun, and we intend to continue performing this ongoing experiment, measuring how censorship unfolds across these platforms in realtime in response to world events.

Evaluation of our Model

We now reflect on how well our modeling of search platforms’ censorship rules as a “list of ordered sequences” fits with their censorship behavior in practice. In general, we found our results to be highly internally consistent using our model. However, both Jingdong and Bing showed inconsistencies which should not be strictly possible in our model. For instance, on Jingdong, we found that both the strings “枪” and “射网枪” are censored even though the string “网枪”, which contains “枪”, is not. On Bing, we found that both the strings “89天安门” and “1989天安门” are soft censored, even though the string “989天安门”, which contains “89天安门”, is not. On these platforms, characters surrounding censored keywords can sometimes seemingly play a role in determining whether to censor a string, behavior which is currently not captured by our model, and thus the number of censored keyword combinations may be underreported on these platforms.

While these were minor departures from our model, a more extreme case would be all four keyword combinations which our algorithm found to be hard censored on Bing:

  • 台湾 + 小穴 + 护士做爱 + 台湾 [Taiwan + pussy + nurse sex + Taiwan]
  • 片BT下载BANNED骚逼 [Piece BitTorrent Download BANNED Pussy]
  • 你 + 你 + 你 + 你的屄 [you + you + you + your cunt]
  • 你 + 你 + 你的屄 + 你 [you + you + your cunt + you]

Unlike our results for other platforms, including those of Bing’s soft censorship, the results for Bing’s hard censorship are, while seemingly related to eroticism, mostly nonsensical to human interpretation. The results pages for each of these four queries showed an explicit notification that results were blocked due to a mandatory “safe search” filter being applied to the mainland Chinese region, and we suspect that we were triggering a machine learning classification system trained to detect search queries related to eroticism. While machine learning algorithms struggle to censor according to subtle, broad, and rapidly evolving political criteria, they are more effective at detecting relatively narrower, more well-defined, and more slowly changing criteria such as whether a query is related to pornography. As such, these results may be an interesting glimpse into what would happen if we applied our isolation algorithm against a censorship system applying a machine learning classifier intending to politically censor content.

Authorized domain lists

All three web search engines which we analyzed performed soft censorship, a censorship scheme in which if a query contained a soft-censored combination of keywords, then results would only be returned from a list of authorized domains. In this section, we explore whether different search engines authorized different domains and whether different domains are authorized for different keyword combinations.

To investigate these questions, first we developed a method to measure whether a domain was authorized to be displayed in results for a given string. Our method is a simple modification of the one which we used to determine whether a string is soft censored in general: we replace “site:microsoft.com”, which was a domain which we presumed would not be authorized for any soft-censored string, with “site:IsThisAuthorized.com”, where IsThisAuthorized.com is a domain which we wish to test to see if it is authorized for that soft-censored string. Using this method, we tested across a set of domains D and a set of strings S.

To choose S, we selected those name test strings which we found soft censored on all three web search engines in Experiment 1. To determine D, for each of those strings, on each platform, we then searched for these strings, recording all domains which we observed in the first 100 search results. D then is the set of all domains which we observed during this procedure. In our experiment, S consisted of 83 strings, and D consisted of 326 domains.

In October 2022, on each web search engine, for each domain in D and string in S, we tested whether that domain was authorized for that string on that search engine. We then collected the results into a two dimensional matrix. To draw out the general shape of the lists, we hierarchically clustered both dimensions of the matrix according to the UPGMA method.

The “shape” of the authorized domains lists for Baidu (left), Bing (center), and Sogou (right): for each domain (x axis) and string (y axis), whether the domain is authorized for that string (light yellow) or not (dark red).
Figure 9: The “shape” of the authorized domains lists for Baidu (left), Bing (center), and Sogou (right): for each domain (x axis) and string (y axis), whether the domain is authorized for that string (light yellow) or not (dark red).

We found disparate authorization lists across web search engines (see Figure 9). We found that, in our experiment, Sogou authorized, on average, the fewest domains for each string, followed by Bing, with Baidu authorizing the most. Bing used the same authorization list for each string which we tested, whereas Baidu appeared to use approximately two different lists, although some strings used lists with small additions or subtractions from these two. Sogou appeared to mostly use two lists, with a third and fourth list being applied to some tested strings. In comparing Baidu and Bing, Baidu had a more complicated set of authorizations, whereas Bing broadly applied the same list to each string and thus authorizes fewer domains overall. While one might hypothesize that more sensitive keyword combinations are associated with shorter lists of authorized domains, we surprisingly did not notice any correlation between sensitivity and authorized domain list length.

To better understand the domains authorized by these search engines, we categorized them into three categories (see Table 12). The majority of sites on each list were Chinese state-approved news sites. Examples include xinhua.org (Xinhua News Agency), people.cn (People’s Daily), and qq.com (QQ News). Sites from this category professed varying degrees of loyalty to the Chinese Communist Party (CCP), ranging from presenting the necessary regulatory license to practice journalism to statements such as these from huyangnet.cn (Authoritative information release platform of Xinjiang production and Construction Corps) indicating Party sponsorship (translated): “Bingtuan Huyang.com is a key news portal website of Bingtuan, which is approved by the Information Office of the State Council and sponsored by the Propaganda Department of the Party Committee of Xinjiang Production and Construction Corps.”

List
News
Party-state
Other
Total
D
241 (73.9%)
77 (22.8%)
8 (2.45%)
326
Baidu (short)
45 (64.3%)
24 (34.3%)
1 (1.43%)
70
Baidu (long)
221 (73.4%)
76 (25.2%)
4 (1.33%)
301
Bing
82 (89.1%)
8 (8.70%)
2 (2.17%)
92
Sogou (shortest)
11 (84.6%)
2 (15.4%)
0 (0.00%)
13
Sogou (shorter)
22 (91.7%)
2 (8.33%)
0 (0.00%)
24
Sogou (longer)
50 (84.7%)
8 (13.6%)
1 (1.69%)
59
Sogou (longest)
51 (76.1%)
13 (19.4%)
3 (4.48%)
67

Table 12: Breakdown of authorized domain lists by category.

Other sites were more directly operated by either the Chinese Communist Party or the Chinese state. Many of these were official government websites of different jurisdictions, such as xinjiang.gov.cn, the official web page of the People’s Government of Xinjiang Uyghur Autonomous Region, and gqt.org.cn, the official website of the Chinese Communist Youth League.

Finally, we have a small residual category, which contains miscellaneous sites such as search engines, those providing health information, etc.

One behavior which we were interested in understanding was how search engines behaved when two different soft-censored strings with different authorization lists occurred in the same query. Depending on how the systems are implemented, search engines may prefer the first observed keyword combination (such as if all censorship rules were implemented using a single deterministic finite-state automaton) or they might take the set intersection of all of the authorization lists for each occurring keyword combination. We found, however, that, when testing two different censored strings with different authorization lists, the list of one is preferred over the other regardless of their positions with respect to each other in the query (see Tables 13 and 14). This finding is consistent with a system which iterates over a list of blocked keyword combinations, testing for their presence in a query, and where, as soon as one is found present, the corresponding action for that keyword combination is taken, aborting the rest of the search.

baidu.com mofcom.gov.cn
李克强李剋強 Authorized Unauthorized
司徒華司徒华 Unauthorized Authorized
李克强李剋強司徒華司徒华 Authorized Unauthorized
司徒華司徒华李克强李剋強 Authorized Unauthorized

Table 13: On Baidu, when both “李克强李剋強” [Li Keqiang, in both simplified characters and traditional characters] and “司徒華司徒华” [Situ Hua, in both traditional and simplified characters] are present, the authorized domains list for “李克强李剋強” is used.

tibet.cn baidu.com
韩正韓正 Authorized Unauthorized
馬凱碩马凯硕 Unauthorized Authorized
韩正韓正馬凱碩马凯硕 Authorized Unauthorized
馬凱碩马凯硕韩正韓正 Authorized Unauthorized

Table 14: On Sogou, when both “韩正韓正” [Han Zheng, in both simplified and traditional characters] and “馬凱碩马凯硕” [Ma Kaishuo, in both traditional characters and simplified characters] are present, the authorized domains list for “韩正韓正” is used.

While this may seem like a mundane finding, it suggests that the original order of keyword combinations discovered on the list can be reconstructed, at least partially, as the order of two keyword combinations with the same authorization lists cannot be directly compared using this method. Such an information side channel could be useful in measuring when a keyword combination was added to the list, where otherwise we would only know when it was first discovered on the list. Furthermore, not just knowing which keyword combinations are censored but also their order on a blocklist can be helpful for inferring how such lists of censorship rules are shared among companies, developers, and other actors, as two lists might have many censorship rules in common by coincidence but, as the number of common censorship rules grows, it becomes super-exponentially unlikely that both lists would have those censorship rules in the same order purely by coincidence.

An example geoblocking block page for a popular Chinese news site.
Figure 10: An example geoblocking block page for a popular Chinese news site.

Finally, when we were categorizing the domains, we noticed that a surprisingly large number of sites were inaccessible from outside of China. We found that, among the 338 domains in D, when accessed from a Toronto network, 59 (17.5%) failed to return an HTTP 200 status for either that domain or for that domain preceded by “www.”. Some sites appeared to block connections on an IP or TCP layer, whereas others presented application layer block pages (see Figure 10). While the motivation for Chinese censors blocking non-Chinese sites from Chinese access is well understood, it is less understood why Chinese sites are in turn blocking access to non-Chinese users. Future research is required to understand this troubling progression of the balkanization of the Internet.

Limitations

Our study analyzes automated censorship of search queries across a variety of platforms. However, there exist other layers of censorship which might also be affecting search results. For instance, on social media platforms, posts may be automatically or manually deleted or shadow-banned if they contain sensitive content. In fact, the rules for automatically censoring posts may often match the rules for censoring search queries (see Appendix A). Users may also self-censor under the fear of reprisal for posting sensitive content. However, our work analyzes the rules used by platforms to automatically censor search results but not any of the other factors which might be skewing those results.

Some search platforms may have some censorship rules which censor not according to whether a query contains certain keywords but whether the query exactly equals a certain keyword. While this may seem like an inflexible manner to censor queries, we have observed such a case on Weibo, specifically when searching by hashtag, when we observed one hashtag which was censored (e.g., #hashtag) but superstrings of that hashtag which were not (e.g., #hashtagXYZ). Our method will often fail to detect censorship rules such as these which require exact matches, as our isolation algorithm requires that any query containing the censored content be censored in order to isolate the content triggering censorship.

Discussion

As North American technology companies such as Google mull over whether to expand search or other services to the Chinese market, a popular argument has been that, although infringing on users’ political and religious rights is inherently wrong, perhaps a North American company could better resist Chinese censorship demands and provide a less-infringing service than a Chinese company. However, even if the ends are to justify the means, then, for this argument to have any validity, the service provided by the North American company must be less infringing.

Unfortunately, our study provides a dismal data point concerning this argument. It suggests that whatever longstanding human rights issues pervade in China, they will not be magically addressed by North American technology companies pursuing business in the Chinese market. To the contrary, our report shows that users using Microsoft Bing are subject to broader levels of political and religious censorship than if they had used the service of Bing’s chief Chinese competitor. In fact, rather than North American companies having a positive influence on the Chinese market, the Chinese market may be having a negative influence on these companies, as previous work has shown how the Chinese censorship systems designed by Microsoft and Apple have affected users outside of China.

The methods introduced in our work facilitate future, ongoing censorship measurement. In light of our third experiment, we presented preliminary results from an ongoing experiment discovering search platform censorship by sampling text from news articles. We intend to continue running this experiment for the indefinite future, tracking changes to search platform censorship over time as events around the world unfold.

The challenges in moderating search queries are similar to those moderating queries to machine-learning-powered chat bots such as ChatGPT in that, just as with search platforms, when compared to the understanding of the actual query evaluator, the censorship system may have an inconsistent understanding of a query which can be exploited to measure for the presence of censorship. As one possible example, AI researcher Gary Marcus found through experimentation that ChatGPT responded to the query, “What religion will the first Jewish president of the United States be?” as follows: “It is not possible to predict the religion of the first Jewish president of the United States. The U.S. Constitution prohibits religious tests for public office… It’s important to respect the diversity of religions and beliefs in the United States and to ensure that all individuals are treated equally and without discrimination.” The ChatGPT query he used to reveal its censorship is tautological in a way reminiscent of our use of truisms to reveal search engine censorship. In the same way that tautological queries which should have guaranteed answers can be used to measure chat-bot censorship, our work uses truisms, which should have guaranteed search results, to measure search platform censorship rules. We hypothesize that, as Baidu or Microsoft, introduce their chat bots into the Chinese market, a similar use of such tautological queries or truisms can be used to flesh out what political censorship rules these bots implement to comply with Chinese political censorship laws and regulations.

Finally, we hypothesize that the methods we used to test for search engine censorship can be adapted to evade censorship as well. In this report, to measure search platforms’ censorship rules, we utilized the inconsistency between the censorship filter’s and the query parser’s understanding of specially crafted queries. In Appendix A, we present a proof of concept demonstrating how this inconsistency can be exploited to evade censorship on Weibo. We leave it to future researchers to exploit this inconsistency to develop additional evasion techniques to evade Weibo search censorship as well as to evade search censorship on other search platforms at large.

Acknowledgments

We would like to thank an anonymous researcher for contributing to this report. We would also like to thank Jedidiah Crandall, Jakub Dalek, Katja Drinhausen, Pellaeon Lin, Adam Senft, and Mari Zhou for valuable editing and peer review. Research for this project was supervised by Ron Deibert.

Availability

For Experiment 1, the list of names we discovered to be hard or soft censored on each platform as well as the keyword combination triggering their censorship is available here. For Experiment 2, the list of keyword combination rules which we discovered to be triggering hard or soft censorship on each platform tested is available here. For our authorized domains experiment, for each platform, a matrix detailing whether each tested string was authorized or unauthorized for each tested domain is available here. The algorithms which we use to isolate which combination of keywords is triggering censorship of a string are available here.

Appendix A: Evasion of Weibo search censorship

A technique in our report, which makes many of our methods possible, is to exploit how the censorship system has a different, typically more naive understanding of a search query than the search platform proper. For instance, when testing Baidu Zhidao, in order to guarantee a nonzero number of search results in the absence of censorship, we test a query by searching for combining two predicates: the presence of a common word and then for the lack of presence of a sensitive keyword, e.g., “the -(xi jinping)”. While an intuitive understanding of this query might suggest that excluding sensitive content should not subject the query to censorship, the censorship filter nevertheless acts on a different level of understanding, by simply scanning the query string for the presence of sensitive keywords. In the remainder of this appendix, we present a proof-of-concept exploitation of the same gap in understanding to demonstrate a method of evading Weibo’s search censorship, which can be used by censorship researchers in determining “ground truth” search results as well as motivating other techniques for evading search censorship on Weibo and platforms at large.

The technique we present is simply to put an underscore (_) between at least one of (or all of) the Chinese characters in a censored keyword. For instance, at the time of this writing, “法轮” [Falun] is hard censored by Weibo search. However, “法_轮” returns results for posts containing “法轮”, whereas separating the characters with a space (“法␣轮”) merely returns results for posts containing “法” and “轮”, although not necessary adjacently. The reason why placing underscores evades censorship is due to the fact that the search platform’s query parser seemingly removes underscores between Chinese characters before evaluating the query but the search filter performs no such removal before scanning the search query for sensitive content.

Using underscores in this manner to evade Weibo search censorship appears to have some limitations. First, we have not observed this evasion to work with English letters, and so it only applies to evading censored keyword combinations which contain at least one keyword containing at least two Chinese characters. Moreover, we have not observed this evasion to work when searching by hashtag, as the query parser does not appear to silently remove underscores in such searches. Finally, in addition to search queries, posts themselves on Weibo may also be subject to automated deletion based on the presence of sensitive combinations of keywords. Thus, when searching for a keyword or combination of keywords that is simultaneously banned from appearing in posts, bypassing the search censorship will still yield zero results as there are no matching posts to return as they have already been deleted. While the first two limitations discussed in this paragraph may be overcome by further development of this evasion technique, any method for evading search censorship rules will still be limited to only returning results which were not also victim to content deletion rules.

]]>
Bada Bing, Bada Boom: Microsoft Bing’s Chinese Political Censorship of Autosuggestions in North America https://citizenlab.ca/2022/05/bada-bing-bada-boom-microsoft-bings-chinese-political-censorship-autosuggestions-north-america/ Thu, 19 May 2022 13:00:50 +0000 https://citizenlab.ca/?p=78364 Key Findings
  • We analyzed Microsoft Bing’s autosuggestion system for censorship of the names of individuals, finding that, outside of names relating to eroticism, the second largest category of names censored from appearing in autosuggestions were those of Chinese party leaders, dissidents, and other persons considered politically sensitive in China.
  • We consistently found that Bing censors politically sensitive Chinese names over time, that their censorship spans multiple Chinese political topics, consists of at least two languages, English and Chinese, and applies to different world regions, including China, the United States, and Canada.
  • Using statistical techniques, we preclude politically sensitive Chinese names in the United States being censored purely through random chance. Rather, their censorship must be the result of a process disproportionately targeting names which are politically sensitive in China.
  • Bing’s Chinese political autosuggestion censorship applies not only to their Web search but also to the search built into Microsoft Windows as well as DuckDuckGo, which uses Bing autosuggestion data.
  • Aside from Bing’s Chinese political censorship, many names also suffer from collateral censorship, such as Dick Cheney or others named Dick.

Introduction

Companies providing Internet services in China are held accountable for the content published on their products and are expected to invest in technology and human resources to censor content. However, as China’s economy expands, more Chinese companies are growing into markets beyond China, and, likewise, the Chinese market itself has also become a significant portion of international companies’ sources of profit. Companies operating Internet platforms with users inside and outside of China increasingly face the dilemma of appeasing Chinese regulators while providing content without politically motivated censorship for users outside of China. Such companies adopt different approaches to meeting the expectation of international users while following strict regulations in China.

Some companies such as Facebook and Twitter do not presently comply with Chinese regulations, and their platforms are blocked by China’s national firewall. Other companies operate their platforms in China but fragment their user bases. For instance, Chinese tech giant ByteDance operates Douyin inside of China and TikTok outside of China, subjecting Douyin users to Chinese laws and regulations, while TikTok is blocked by the national firewall. Users of one fragment of the platform are not able to interact with users in the other. Finally, companies can combine user bases but only subject some communications to censorship and surveillance. Tencent’s WeChat implements censorship policies only on accounts registered to mainland Chinese phone numbers, and, until 2013, Microsoft’s Skype partnered with Hong Kong-based TOM Group to provide a version of Skype for the Chinese market that included censorship and surveillance of text messages. Platforms with combined user bases often provide users with limited transparency over whether their communications have been subjected to censorship and surveillance due to Chinese regulations.

Previous research has demonstrated a growing number of companies that have either accidentally or intentionally enabled censorship and surveillance capacities designed for China-based services on users outside of China. Our analysis of Apple’s filtering of product engravings, for instance, shows that Apple censors political content in mainland China and that this censorship is also present for users in Hong Kong and Taiwan despite there existing no written legal requirement for Apple to do so. While WeChat only implements censorship on mainland Chinese users, we found that communications made on the platform entirely among non-Chinese accounts were subject to content surveillance which was used to train and build up WeChat’s political censorship system in China. TikTok has reportedly censored content posted by American users which was critical of the Chinese government. Zoom (an American-owned company based in California) worked with the Chinese government to terminate the accounts of US-based users and disrupt video calls about the 1989 Tiananmen Square Massacre.

In the remainder of this report, we analyze Microsoft Bing’s autosuggestion system for censorship of people’s names. We chose to test people’s names since individuals can represent highly sensitive or controversial issues and because, unlike more abstract concepts, names can be easily enumerated into lists and tested. We begin by providing background on how search autosuggestions work and their significance. We then set out an experimental methodology for measuring the censorship of people’s names in Bing’s autosuggestions and explain our experimental setup. We then describe the results of this experiment, which were that the names Bing censors in autosuggestions were primarily related to eroticism or Chinese political sensitivity, including for users in North America. We then discuss the consequences of these findings as well as hypothesize why Bing subjects North American users to Chinese political censorship.

Background

In this section, we provide background on search autosuggestions and their significance as well as discuss Microsoft Bing and Microsoft’s history in China.

Search engine autosuggestions

Search engines play an important role in distributing content and shaping how the public perceives certain issues. Previous studies have analyzed algorithmic biases and subtle censorship implemented by Baidu in China and Yandex in Russia, each favoring pro-regime and pro-establishment results via source bias and reference bias.

Left, Baidu censoring mention of “Xi Jinping” in autosuggestions for “xi” followed by a space; right, Baidu censoring all autosuggestions for “xi jin”.
Figure 1: Left, Baidu censoring mention of “Xi Jinping” in autosuggestions for “xi” followed by a space; right, Baidu censoring all autosuggestions for “xi jin”.

In addition to displaying search engine results, search engines also implement autosuggestion (sometimes called autofill or autocomplete) functionality. Autosuggestions are used to fix user typos and also guide and suggest search queries, and autosuggestions often contain answers to a user’s question in themselves. Accordingly, autosuggestions play an important role in informing the user. For example, recent reports on COVID-19 misinformation found that online autosuggestion results influence how users are subject to medical misinformation. These studies collectively demonstrate that search engines can potentially be “architecturally altered” to serve a particular political, social, or commercial purpose by controlling not only what users are able to see in search results but also the search phrases users might enter in the first place.

Communicating via autosuggestions

While autosuggestion systems can be thought of as a means for users to quickly obtain information, the communication in these systems is usually not one-way. Microsoft researchers have previously noted that “[a]utosuggestion systems are typically designed to predict the most likely intended queries given the user’s partially typed query, where the predictions are primarily based on frequent queries mined from the search engine’s query logs” and that, “[s]ince the suggestions are derived from search logs, they can, as a result, be directly influenced by the search activities of the search engine’s users.” Resultantly, Google has repeatedly struggled to keep its autosuggestions free of hate speech.

Autosuggestion features are known to be under censorship in China. For example, Baidu is known to filter autosuggestions relating to sensitive topics (see Figure 1). This practice is consistent with the general information control regime in China, which requires all Internet communications to be subject to political censorship. In our previous work, we have found a wide range of user content subject to censorship, including messages, group chat, file contents, usernames, mood indications, and user profile descriptions. As autosuggestions are based on users’ historical searches, they are the result of users’ input and thus are required to be moderated and censored for prohibited content in China.

Microsoft Bing

In 1998, Microsoft launched Bing’s predecessor, MSN Search. After transitioning through multiple name changes, Microsoft rebranded the search engine as Bing in 2009 and finally as Microsoft Bing in 2020.

While Bing’s market share varies regionally, as of March 2022, Bing is used by 6.6 percent of Web users in the United States, 5.4 percent of users in Canada, and 6.7 percent of users in China. While Google is the most popular search engine in North America, Baidu is Bing’s primary competitor in China.

In addition to Web usage, Bing also sees usage through its integration into multiple Microsoft products and through other search engines which use its data. Since Windows 8.1, Bing has been built into the Windows start menu, providing autosuggestions and search results for queries searched using the Windows start menu search functionality. Bing is also the default search engine in Microsoft Edge, Microsoft’s cross-platform, Chromium-based Web browser, providing both autosuggestions and search results for queries typed into the browser’s search bar. Finally, Bing provides autosuggestion and search result data for other search engines, including DuckDuckGo and Yahoo.

Microsoft in China

Entering the Chinese market in 1992, Microsoft established an early presence in China long preceding the Chinese debut of its Internet search engine. From computers’ operating systems to gaming consoles and communications platforms, the American company has invested in multiple technology sectors and is largely successful in China despite the country’s restricted regulatory environment. It is unclear exactly how much the China market accounts for Microsoft’s global revenues, as the company has kept it a secret for years, leaving the estimated figure to be as low as around one percent to as high as 10 percent. It is clear, however, that Microsoft continues to expand in China and its relations with Chinese regulators appear relatively stronger than many other American tech companies despite some rough interactions.

One of the reasons for Microsoft’s continued success in China might be its implementation of censorship in response to China’s content regulations. Before Microsoft announced that it will pull LinkedIn from the Chinese market in October 2022 citing a “challenging operating environment,” LinkedIn was found to censor posts or personal profiles considered sensitive to the Chinese government. Similarly, Microsoft has censored results on Bing in China since 2009. However, the company is under growing scrutiny concerning whether it will expand censorship of Chinese political sensitivity beyond China to advance its commercial interests. In 2021, Bing was found to censor image results for the query “tank man” in the United States and elsewhere around the 1989 Tiananmen Square Movement anniversary. Microsoft said the blocking was “due to an accidental human error,” dismissing concerns about possible censorship beyond China. In December of 2021, the Chinese government suspended Bing’s search autosuggestions for users in China for 30 days.

Methodology

Our analysis found that, other than the search query that the user has typed so far, at least three variables affect the autosuggestions provided by Bing: the user’s region setting, language setting, and geolocation as determined by the user’s IP address. For purposes of measuring censorship of popularly searched names of individuals, we found it primarily relevant whether a user’s IP address is inside or outside of mainland China. In the remainder of this report, we will call a combination of (1) a region, (2) a language, and (3) whether one’s geolocation is inside mainland China a locale. Outside of geolocation, the other aspects of a locale can be easily set using Bing’s Web UI or by manually setting URL parameters. For instance, to switch Bing to the “en-US” (United States) region and “fr” (French) language, users can visit the following URL:

https://www.bing.com/?mkt=en-US&setlang=fr

We found that these settings affect the entire browsing session, and so setting them will affect other browser tabs and windows, unless those tabs or windows were specifically created in a separate browsing session. We set these URL parameters to automatically test different Bing regions and languages.

To test a variety of locales, we chose a subset of the regions documented by Microsoft in Bing’s API documentation, namely: “en-US” (United States), “en-CA” (Canada), and “zh-CN” (mainland China). For each region tested, we tested two different languages, English (“en”) and simplified Chinese (“zh-hans”). Geolocation as determined by IP address can affect autosuggestions provided by Bing, such as Bing suggesting local restaurants when searching for dining options. However, while we found that whether one’s IP address was inside or outside of mainland China dramatically affected the level of censorship Bing applied to autosuggestions, we are not otherwise aware of IP address affecting Bing’s autosuggestion filtering. Accordingly, we test from two different networks, a network in North America and a network in mainland China.

Since Bing only allows the mainland China region to be selected when accessing the site from a mainland China IP address, we are not able to test other regions from a mainland China IP address. Thus, instead of testing different regions from this address, we test a feature specific to the mainland China region, whether the search engine is configured to use the 国内版 (domestic version) versus the 国际版 (international version). While we were unable to find documentation clearly elucidating the differences between these versions, we generally found that the domestic version was more likely to interpret English letters as Chinese pinyin whereas the international version generally interpreted English letters as English words. For nomenclature purposes, we consider the “domestic version” and the “international version” to be two different languages of the mainland China region, even though they are not languages per se.

To test for censorship in each locale, we used sample testing. We generated queries to test in each locale using the following method. From English Wikipedia, we extracted the titles of any article meeting all of the following criteria:

  • After stripping parentheticals from its title, the article’s title consisted entirely of English alphabetic characters or spaces.
  • The article received at least 1,000 views during September 2021.
  • The article contained either a “person” or an “officeholder” infobox (English Wikipedia articles generally use a special infobox for political officeholders).

The resulting list of article titles we henceforth refer to as English letter names.

From Chinese Wikipedia, we also extracted the titles of any article meeting all of the following criteria:

  • After stripping interpunct (“·”) symbols and parentheticals from its title, the article’s title consisted entirely of simplified Chinese characters or spaces. (In Chinese, interpuncts are often used to mark the separation of first, last, and other names in names transliterated from other languages.)
  • The article received at least 1,000 views during September 2021.
  • The article contained either a “person” or an “藝人” (artist) infobox (Chinese Wikipedia articles generally use a special infobox for artists such as actors or singers).

The resulting list of article titles we henceforth refer to as Chinese character names.

For each of these names, we generated three queries from the name as follows:

  1. the name minus the last letter (e.g., “Xi␣Jinpin”*)
  2. the name itself (“Xi␣Jinping”)
  3. the name followed by a space (“Xi␣Jinping␣”)

If, for a given name, none of its queries’ autosuggestions contain the original name (“Xi␣Jinping”), including if there were no autosuggestions at all, then we say that the name was suggestionless.

* In this report, we use the “open box” symbol “␣” to unambiguously render spaces when describing Bing test queries. We do this because often spaces appear at the end of our test queries, which might be difficult to display using the traditional space character “ ”. For instance, instead of rendering “xi” followed by a space as “xi ”, we render it as “xi␣” so that it is obvious that a single space character trails “xi” in this query.

Just because a name is suggestionless does not necessarily mean that any censorship is occurring, as it might be a name that is uncommonly searched for using Bing or otherwise too obscure. While we previously restricted our names to only those whose corresponding Wikipedia articles had over 1,000 views in a month, Wikipedia article views are not necessarily a predictor of how often a term is searched on Bing. Thus, to help ensure that a name is sufficiently popular on Bing to justify the conclusion that it is censored, we utilized Bing’s Keyword Research API, which provides search “query volume data” in units called impressions. Bing describes these “impressions” as “based on organic query data from Bing and is raw data, not rounded in any way.”

For each locale, for all suggestionless names, we used the Keyword Research API to determine how many times that name had been searched in that locale’s region. If the name reported at least 35 impressions in the last six months, we concluded that the suggestionless name had been censored in that region. We chose the number 35 qualitatively, as suggestionless names with at least 35 impressions tended to fall into predictable categories such as being related to eroticism, misinformation, or Chinese political sensitivity, whereas names with fewer than 35 impressions more often had no identifiable motivation for being censored and were thus more likely to be suggestionless due to their search unpopularity.

Experimental setup

In our methodology, we describe testing from two different networks: a North American network and a mainland Chinese network. For the North American network, we tested from Toronto, Canada, and, for the Chinese network, we tested from Shaoxing, China. The Toronto testing occurred from a machine hosted on a DigitalOcean network, and the Shaoxing testing was performed using a VPN server provided by a popular VPN service whose Chinese vantage points we had confirmed to be in China and subject to censorship from China’s national firewall. We performed the experiment described above during the week of December 10–17, 2021.

Findings

Performing this experiment, we collected 7,186 Chinese character names from Chinese Wikipedia and 97,698 English letter names from English Wikipedia.

Table 1: Among 7,186 Chinese character names, for each locale, how many were suggestionless and how many of the suggestionless were censored (had at least 35 search volume impressions).
Table 1: Among 7,186 Chinese character names, for each locale, how many were suggestionless and how many of the suggestionless were censored (had at least 35 search volume impressions).
 Table 2: Among 97,698 English letter names, for each locale, how many were suggestionless and how many of the suggestionless were censored (had at least 35 search volume impressions).
Table 2: Among 97,698 English letter names, for each locale, how many were suggestionless and how many of the suggestionless were censored (had at least 35 search volume impressions).

 

Each locale tested had suggestionless names as well as names that we found to be censored (see Tables 1 and 2). Most of the suggestionless names had little to no search volume and thus were most likely suggestionless due to being insufficiently popular search queries. However, as we explain in our methodology, for each locale, we consider a suggestionless name censored only if its locale had at least 35 Bing search volume impressions in the six months prior to our experiment. Resultantly, we found between 32 and 146 Chinese character names censored in each locale tested and between 162 and 653 English letter names censored in each locale tested.

Since there may exist names that are both targeted by Bing for censorship and that are low search volume, one consequence of our requirement that a name have at least 35 impressions in a region before being considered censored is that we would expect to discover more censored words in regions with more Bing users and thus more search volume. Thus, with our data, we cannot use the absolute number of censored names across different regions as the means to say that one region is more censored than another. To state this another way: Bing may be targeting names for censorship in regions whose censorship we were unable to detect due to those names having insufficient search volume in those regions. Thus, if we detect a name censored in one region but not another, we may have not detected the name censored in the other region merely due to it having insufficient search volume in that region.

Table 3: For each locale, the number of Chinese character names which have at least 35 impressions in that region and that have no autosuggestions, according to each name's content category.
Table 3: For each locale, the number of Chinese character names which have at least 35 impressions in that region and that have no autosuggestions, according to each name’s content category.
Table 4: For each locale, the number of English letter names which have at least 35 impressions in that region and that have no autosuggestions, according to each name’s content category.
Table 4: For each locale, the number of English letter names which have at least 35 impressions in that region and that have no autosuggestions, according to each name’s content category.

To better understand the names we discovered to be censored, we categorized them based on their underlying context (see Tables 3 and 4 for more details). We began by reviewing all of the names and abstract common themes among them. We then reviewed the names again and categorized them into the common themes that we discovered, which include “Chinese political” (e.g., incumbent and retired Chinese Communist Party leaders, dissidents, political activists, and religious figures), “historical figure” (e.g., ancient philosophers and pre-PRC thinkers), “international politician”, “entertainment” (e.g., singers, celebrities), and “eroticism”. We categorized as “eroticism” anyone who meets or has met any of the following criteria: anyone participating in pornography or its production, glamor models, gravure models, burlesque dancers, drag queens, and anyone who has been a famous victim of a nude photo or sex video leak. As with the other categories, we based our criteria for this category not on our own intuitions but rather on what we found Bing to censor.

Finally, we created two special categories, “collateral” and “overshadowed”. We assigned the “collateral” category to names that do not appear to be directly targeted for censorship but rather appear to be collateral censorship from some other censorship rule. We found that the most common reason for a name being collaterally censored was containing the name “Dick”, e.g., “Dick Cheney”. The “overshadowed” category is similar in that we assign names to this category which do not appear targeted for censorship. However, instead of being collaterally censored, we believe that overshadowed names do not have autosuggestions because they are overshadowed by the autosuggestions of someone with a similar name. For example, we found that suggestions for actor “Gordon Ramsey” were overshadowed by suggestions for the more famous “Gordon Ramsay”, celebrity chef. However, suggestions for the less famous “Gordon Ramsey” could still be found if we used a more specific query, such as “Gordon Ramsey actor”.

Left, October 2021 autosuggestions for “xi” in the United States English locale which contain no mention of Xi Jinping; right, a complete absence of autosuggestions for “xi␣”. Our December 2021 measurement findings were analogous.
Figure 2: Top, October 2021 autosuggestions for “xi” in the United States English locale which contain no mention of Xi Jinping; bottom, a complete absence of autosuggestions for “xi␣”. Our December 2021 measurement findings were analogous.

In nearly every locale we tested, censored Chinese character names were most likely to belong to the “Chinese political” category, whereas censored English letter names were more likely to belong to the “eroticism” category, although each locale also censored “Chinese political” English letter names such as “Xi Jinping”, “Liu Xiaobo”, and “Tank Man” (see Figure 2 for an example). Overall, when considering all censored Chinese character and English letter names in aggregate, the largest number of names were in the “eroticism” category followed by the “Chinese political” category, excluding the two special “collateral” and “overshadowed” categories.

Already this might seem like compelling evidence that Bing performs Chinese political censorship both inside and outside of China. However, how can we be certain? After all, while many sensitive Chinese political names are censored on Bing, many are not. Moreover, in locales like United States English, there were 11 “Chinese political” English letter names censored – is that even significant? Perhaps these names are censored simply due to some defect in Bing that fails to show suggestions for names uniformly at random. In the following section, we use statistical techniques to answer whether Bing is performing any targeted censorship at all, whether any of the targeting is for names of Chinese political sensitivity, and whether such censorship extends outside of China.

Is there Chinese political censorship in North America?

Up to this point, we have found the names of people who are popularly read about on Wikipedia and searched about on Bing but yet, for some reason, have no Bing autosuggestions. These names appear to often fall into certain categories, such as being associated with eroticism, being politically sensitive in China, or containing certain English swear words in them such as “dick”. Still, how do we really know that these names are being specifically targeted for censorship and that they are not just random failures of Bing’s autosuggestion system? For instance, it could be that, because of the way we selected popular names, we found a lot of names related to eroticism and Chinese politics because those are the names that are most popularly searched for on Bing or read about on Wikipedia.

In this section, we further explore the nature of Bing’s censorship using statistical techniques. Since it is already well known that Bing implements censorship in mainland China to comply with legal requirements, we focus in this section on whether Bing implements Chinese political censorship in North America. Particularly, we look at the United States English locale, as we presume that that locale is the most common locale of users in North America.

Is there any targeted censorship of Chinese character names in the United States?

In our methodology, recall that we select names to test by only looking at sufficiently popular Wikipedia articles about people. We then further filter our results by only considering suggestionless names with a sufficiently high Bing search volume. Thus, in two ways we select for names that are popular. It is therefore possible that, even if the names for which Bing failed to show autosuggestions were not really censored but rather chosen by some random process, we may still see themes such as Chinese political sensitivity and eroticism occur with high frequency in our censored data set because these may simply be popular topics commonly viewed on Wikipedia and searched on Bing.

To test the hypothesis that we may only be seeing these results by chance, we use statistical significance testing. If Bing does not target any types of content for censorship and the names that we call censored are resulting from a uniformly random process, then we would expect the proportions of categories in both names that are censored and names that are non-censored to be the same. With the censored words already categorized, we chose 120 non-censored names at random using the same thresholds concerning Wikipedia article views and Bing search volume that we applied to the censored names.

Category
Non-censored
Censored
Chinese political

11

(9.2%)

30

(93.8%)

Entertainment

69

(57.5%)

1

(3.1%)

Eroticism

5

(4.2%)

0

(0.0%)

Historical figure

12

(10.0%)

0

(0.0%)

International politician

3

(2.5%)

1

(3.1%)

Public figure

20

(16.7%)

0

(0.0%)

Total

120

(100.0%)

32

(100.0%)

Table 5: Among Chinese character names, a contingency table for whether a name is censored versus its category.

Just by eyeballing the results of this comparison (see Table 5), we can see that the proportions of categories between censored and non-censored names are radically different. For instance, among the censored names, 93.8 percent are Chinese political, whereas among the non-censored, only 9.2 percent are in that category. Since we chose the non-censored names at random, if Bing were also choosing names to censor at random, we would expect these proportions to be the same.

Nevertheless, using statistical significance testing, we need not rely merely on our intuitions. Using Fisher’s test, we can test this hypothesis, called in statistical significance testing the null hypothesis. Specifically, our null hypothesis under question is that the proportional category sizes in both groups are the same. The result of Fisher’s test is something called a p value, which, in this case, is the probability that we would see at least as extreme of differences in their proportions by chance. Applying Fisher’s test to Table 5, we find that p = 5.35 ⋅ 10-19, confirming our intuitions that these differences are not random chance and that Bing is targeting specific categories of content in the United States for censorship. With a p value this small, other considerations beyond observing as extreme of results by chance become more likely such as that we mistyped the number into this document.

Is there any targeted Chinese political censorship of Chinese character names in the United States?

Above we established that Bing is censoring specific categories of Chinese character names in the United States. However, are they specifically targeting Chinese politically sensitive names for censorship?

Category
Non-censored
Censored
Chinese political

11

(9.2%)

30

(93.8%)

Not Chinese political

109

(90.8%)

2

(6.2%)

Total

120

(100.0%)

32

(100.0%)

Table 6: Among Chinese character names, a contingency table for whether a name is censored versus whether it is Chinese politically sensitive.

To answer this question, we are concerned with a smaller, 2×2 table containing only two categories, “Chinese political” and “not Chinese political”, a category that is the aggregate of all other categories (see Table 6). While again it seems obvious looking at the data that censored names are disproportionately Chinese political compared to names chosen at random, we nevertheless perform Fisher’s test. In this case, we formulate our null hypothesis to be that the proportion of censored Chinese political names is no greater than the proportion of Chinese political names that are not censored. We find that p = 2.61 ⋅ 10-20, all but confirming that Bing is targeting Chinese politically sensitive names in the United States for censorship.

Is there any targeted censorship of English letter names in the United States?

In the previous section, we used statistical techniques to test whether Chinese character names such as “张高丽” (Zhang Gaoli) were disproportionately targeted for censorship in the United States or whether their lack of autosuggestions might somehow be the result of random failure. In this section, we will apply the same techniques to test this question with respect to censored English letter names such as “Xi Jinping”.

While previously we categorized censored English letter names which were censored for containing words like “Dick” as “collateral” censorship and names that were overshadowed by much more famous people as being “overshadowed”, we have no way of making such categorizations in the non-censored group. Thus, for purposes of our statistical testing, we recategorize such names according to how we would categorize them a priori, i.e., as we would if we did not know whether they were censored or not into ordinary categories such as “entertainment” and so on.

Category
Non-censored
Censored
Chinese political

1

(0.8%)

11

(1.7%)

Entertainment

73

(60.8%)

110

(17.2%)

Eroticism

0

(0.0%)

436

(68.0%)

Historical figure

5

(4.2%)

12

(1.9%)

International politician

10

(8.3%)

12

(1.9%)

Public figure

31

(25.8%)

60

(9.4%)

Total

120

(100.0%)

641

(100.0%)

Table 7: Among English letter names, a contingency table for whether a name is censored versus its category. Since names categorized as “collateral” censorship and “overshadowed” are recategorized a priori, the proportion of censored names in categories such as “entertainment” and “public figure” is much higher than in Table 5.

Just by eyeballing the proportions between censored and non-censored names (see Table 7), we can again see that the proportions of categories between censored and non-censored names are radically different. For instance, among the censored names, 68.0% are related to eroticism, whereas among the non-censored, zero are. Using Fisher’s test, we again test whether the proportions between censored and non-censored are the same, finding that p = 4.64 ⋅ 10-51 probability we would see such extreme differences in proportions by chance if the proportions were the same.

Category
Non-censored
Censored
Eroticism

0

(0.0%)

436

(68.0%)

Not eroticism

120

(100.0%)

205

(32.0%)

Total

120

(100.0%)

641

(100.0%)

Table 8: Among English letter names, a contingency table for whether a name is censored versus whether it is eroticism.

Due to the large differences in proportions in the “eroticism” category, in Table 8, we divide the names into “eroticism” and “not eroticism” categories using the same method as we did to construct Table 6. While again it seems obvious looking at the data that censored names are disproportionately associated with eroticism compared to names chosen at random, we nevertheless perform Fisher’s test. Our null hypothesis is that the proportion of censored Eroticism names is no greater than the proportion of Eroticism names that are not censored. We find that we would expect to see only a p = 9.50 ⋅ 10-52 probability of seeing as great of a number of Eroticism names that are censored by chance, all but confirming that Bing is targeting eroticism-related English letter names for censorship in the United States.

Is there any targeted Chinese political censorship of English letter names in the United States?

Although thus far we have used statistical hypothesis testing in cases where the data already lent to an obvious conclusion, we will now explore a question in which our intuitions may be less able to come to a conclusion by eyeballing the data. Namely, with the assistance of statistical hypothesis testing, we will investigate whether there is Chinese political censorship of English letter names in the United States. Looking at Table 7, it is not obvious.

Category
Non-censored
Censored
Chinese political

1

(0.8%)

11

(1.7%)

Not Chinese political

119

(99.2%)

630

(98.3%)

Total

120

(100.0%)

641

(100.0%)

Table 9: Among English letter names, a contingency table for whether a name is censored versus whether it is Chinese politically sensitive.

Table 9 is the resulting 2×2 contingency table comparing whether a name is censored versus whether it is Chinese politically sensitive. We formulate our null hypothesis to be that the proportion of censored Chinese political names is no greater than the proportion of Chinese political names that are not censored. Applying Fisher’s test, we find that p = 0.412, which is an inconclusive result. Looking at the table, this result is not too surprising, as, even though the proportion of censored Chinese political names (1.7 percent) is over twice the proportion of non-censored Chinese political names (0.8 percent), the absolute number of Chinese political names in both the non-censored and censored columns is small, and thus the test is lacking in statistical power. In the following section we perform a different test looking only at Chinese pinyin names, which achieves a more conclusive result concerning whether Bing performs censorship of English letter names in the United States.

Is there any targeted Chinese political censorship of pinyin names in the United States?

In our tests of English letter names, there exists one possible confounding variable that we have yet to consider. One might argue that, even if Bing’s censorship of “Chinese political” names was the result of a random process, such a process might be disproportionately targeting foreign names such as names written in Chinese pinyin, especially if Bing’s autosuggestion algorithms were more poorly suited to such names. While not all English letter names categorized as “Chinese political” were pinyin (specifically, Tank Man, Gedhun Choekyi Nyima, and Rebiya Kadeer), the remaining eight were (e.g., Xi Jinping, Li Wenliang, etc.). Thus, it may only appear that there is Chinese political censorship of English letter names if Bing is simply bad at providing autosuggestions for pinyin names but not necessarily censoring names politically sensitive in China.

Category
Non-censored
Censored
Chinese Political

18

(15.0%)

8

(100.0%)

Entertainment

64

(53.3%)

0

(0.0%)

Historical Figure

12

(10.0%)

0

(0.0%)

Public Figure

26

(21.7%)

0

(0.0%)

Total

120

(100.0%)

8

(100.0%)

Table 10: Among pinyin names, a contingency table for whether a name is censored versus its category.

To eliminate such a hypothesis, we compare the eight pinyin names to 120 non-censored pinyin names, finding that all eight (100%) censored are “Chinese political” versus only 15% of the non-censored being Chinese political (see Table 10). Although there is a large difference in proportions in the Chinese political category, with the proportion of censored being 6.67 times the non-censored, we are estimating the censored Chinese political proportion from only eight censored pinyin names. Thus, despite the large difference in proportions, it may not be intuitively obvious whether there is a statistically significant difference between the two. However, Fisher’s test already includes the absolute numbers of samples as part of its calculus, so we can nevertheless be confident in its results without making any additional consideration for the sample sizes.

Category
Non-censored
Censored
Chinese Political

18

(15.0%)

8

(100.0%)

Not Chinese Political

102

(85.0%)

0

(0.0%)

Total

120

(100.0%)

8

(100.0%)

Table 11: Among Pinyin names, a contingency table for whether a name is censored versus whether it is Chinese politically sensitive.

Applying Fisher’s test to Table 11 under the null hypothesis that the proportion of censored Chinese political names is no greater than the proportion of Chinese political names that are not censored, we find that p = 1.09 ⋅ 10-6. Even though in choosing to look only at pinyin names we were initially motivated by eliminating a confounding variable, because the experiment was more powerful we were also able to achieve a conclusive result, almost certainly showing that Bing targets English letter names for Chinese political censorship in the United States just as it similarly targets simplified Chinese character names there.

Content analysis

Across the three regions we tested (i.e., mainland China, the United States, and Canada), we observed overwhelming censorship of Chinese character names relating to Chinese politics. These names predominantly pertain to names of top-level Chinese government leaders and party figures, including incumbent leaders (e.g., 习近平, “Xi Jinping”), retired officials (e.g., 温家宝, “Wen Jiabao”, a former Chinese Premier), historical figures (e.g., 李大钊, “Li Dazhao,” a co-founder of the Chinese Communist Party), and party leaders involved in political scandals or power struggle (e.g., 周永康, “Zhou Yongkang,” a former Party leader).

United States English locale Canada English locale
# Name Translation Category Name Translation Category
1 张高丽 Zhang Gaoli Chinese political 张高丽 Zhang Gaoli Chinese political
2 江泽民 Jiang Zemin Chinese political 习近平 Xi Jinping Chinese political
3 王岐山 Wang Qishan Chinese political 江泽民 Jiang Zemin Chinese political
4 胡锦涛 Hu Jintao Chinese political 傅政华 Fu Zhenghua Chinese political
5 周永康 Zhou Yongkang Chinese political 陈独秀 Cheng Duxiu Chinese political
6 曾庆红 Zeng Qinghong Chinese political 王岐山 Wang Qishan Chinese political
7 汪洋 Wang Yang Chinese political 薄熙来 Bo Xilai Chinese political
8 赵紫阳 Zhao Ziyang Chinese political 李大钊 Li Dazhao Chinese political
9 胡春华 Hu Chunhua Chinese political 桃乃木香奈 Kana Momonogi Eroticism
10 王沪宁 Wang Huning Chinese political 林彪 Lin Biao Chinese political

Table 12: Ordered by decreasing search volume, the top 10 Chinese-character names that are censored in the United States English and Canada English locales.

The censorship of Chinese leaders’ names in the domestic and international versions of Bing in China may be due to Microsoft’s compliance with Chinese laws and regulations. However, there is no legal reason for the names to be censored in Bing autosuggestions in the United States and Canada. In Table 12, we highlight the top 10 highest search volume names censored in each of these two North American regions. Although the two regions do not appear to censor exactly the same names, most of the names on both lists are references to Chinese political figures. In the United States English locale, all the top 10 names reference incumbent or recently retired Chinese politicians. In the Canada English locale, three of the top 10 names are referencing people in relation to the history of the Chinese Communist Party such as its founding members; interestingly, “桃乃木香奈” (Kana Momonogi), the name of a Japanese pornographer, also appears on the list.

During our data collection period, former Chinese Vice Premier Zhang Gaoli received the highest search volume on Bing among the names we found censored in the United States and Canada English locales. The high international search volume of Zhang, a retired Chinese politician, is likely due to a scandal in which he is alleged to have sexually assaulted Chinese tennis star Peng Shuai. Peng first published her allegations on Weibo on November 2, 2021, which were quickly censored on all Chinese platforms. She then disappeared from the public for almost three weeks, prompting worldwide media attention on Zhang as well as an international campaign calling for information with regards to Peng’s whereabouts.

United States English locale Canada English locale
# Name Category Name Category
1 Riley Reid Eroticism Mia Khalifa Eroticism
2 Brandi Love Eroticism Brandi Love Eroticism
3 Mia Khalifa Eroticism XXXTentacion Collateral (“XXX”)
4 XXXTentacion Collateral (“XXX”) Adriana Chechik Eroticism
5 Mia Malkova Eroticism Nina Hartley Eroticism
6 Dick Van Dyke Collateral (“Dick”) Kendra Lust Eroticism
7 Xi Jinping Chinese political Jenna Jameson Eroticism
8 Nina Hartley Eroticism Dick Van Dyke Collateral (“Dick”)
9 Adriana Chechik Eroticism Julia Ann Eroticism
10 Jenna Jameson Eroticism Asa Akira Eroticism

Table 13: Ordered by decreasing search volume, the top 10 English letter names that are censored in the United States English and Canada English locales.

Regarding English letter names, we found that most censored names were related to eroticism in each locale that we tested (see Table 13). Even though Bing’s censorship of eroticism may be out of concern over explicit content, such censorship nevertheless has impacts beyond sexual content. For instance, we found that Bing censors the name of “Ilona Staller” in all regions we tested. Ilona Staller appears to be a former Hungarian-Italian porn star who has turned to politics and ran for offices in Italy since 1979. Censoring autosuggestions containing this name may affect the person’s political career as well.

Notably, not only were autosuggestions containing politically sensitive Chinese character names censored, names of Chinese leaders, dissidents, political activists, and religious figures in English letters were also censored in the United States English and Canada English locales. Whereas Chinese political censorship of Chinese character names in the same locales pertain predominately to names of incumbent, retired, and historical Chinese Communist Party leaders, Chinese political censorship of English letter names appears to have a greater variety, some of which relate closely to current events.

United States English locale
# Name Note
1 Xi Jinping Incumbent Chinese president
2 Tank Man Nickname of an unidentified Chinese man who stood in front of a column of tanks leaving Tiananmen Square in Beijing on June 5, 1989
3 Li Wenliang Chinese ophthalmologist who warned his colleagues about early COVID-19 infections in Wuhan
4 Jiang Zemin Former Chinese president
5 Guo Wengui Exiled Chinese billionaire businessman
6 Liu Xiaobo Deceased Chinese human rights activist and Nobel Peace Prize awardee
7 Gedhun Choekyi Nyima The 11th Panchen Lama belonging to the Gelugpa school of Tibetan Buddhism, as recognized and announced by the 14th Dalai Lama on 14 May 1995
8 Li Hongzhi Founder and leader of Falun Gong
9 Li Yuanchao Former Chinese vice president
10 Rebiya Kadeer Uyghur businesswoman and political activist
11 Chai Ling One of the student leaders in the Tiananmen Square protests of 1989

Table 14: Ordered by decreasing search volume, each of “Chinese political” category English letter names censored in the United States English locale.

Table 14 shows the 11 “Chinese political” category names censored in the United States English locale by descending search volume. One of the highest volume censored Chinese political names was that of the late Chinese doctor Li Wenliang. Dr. Li warned his colleagues about early COVID-19 infections in Wuhan but was later forced by local police and medical officials to sign a statement denouncing his warning as unfounded and illegal rumors. Dr. Li died from COVID-19 in early February 2020. References to Dr. Li have been regularly censored on mainstream Chinese social media platforms including WeChat.

In the United States English locale, we also found three names of influential figures spreading COVID-19 misinformation and anti-vaccine messages: Ali Alexander, Pamela Geller, and Sayer Ji. These names are too few in number to be useful to generalize whether and to what extent Bing targets COVID-19 misinformation for censorship. However, Microsoft publicly acknowledges that Bing, like many other Internet operators, applies “algorithmic defenses to help promote reliable information about COVID-19.” Such control of information has also proven controversial, as Internet operators attempt to balance the facilitation of free speech with controlling potentially deadly misinformation concerning COVID-19 and its prevention and treatment.

Cross-region comparison

Previously we have only looked at Bing’s autosuggestion censorship in a few specifically chosen locales. To understand how Bing’s autosuggestions vary across the world, we expanded our testing to each region documented by Microsoft in Bing’s API documentation, although we found in our testing that the documented “no-NO” (Norway) region was invalid, and so we replaced it with the valid “nb-NO” region. For each of these regions, we tested the default language for that region as obtained by not manually choosing any language for that region—for instance, the default language of Japan (“ja-JP”) when no language is manually chosen is Japanese (“ja”). While in this experiment we tested a greater number of regions and a greater number of languages, we tested a more limited set of queries compared to before. Specifically, in each locale, we tested every name that we found to be censored in any locale in our earlier experiment above. Overall, our test set consisted of 1,178 English letter names and 342 Chinese character names across 41 locales.

To compare results across locales, we used hierarchical clustering. Hierarchical clustering is a method of clustering observations into a hierarchy or tree according to their similarity. By flattening it, this tree can be represented as an ordered list where items in the list will tend to be near other items which are similar.

Hierarchical clustering requires some metric to measure the similarity or, more specifically, the dissimilarity between observations. For both English letter and Chinese character names, we hierarchically cluster according to the following dissimilarity metric. For any name query in a locale, Bing provides an ordered string of between zero and eight autosuggestions. To compare two strings of autosuggestions, we compute their Damerau-Levenshtein distance, a common distance metric used on strings, where we consider two individual autosuggestions equal only if they are identical. Finally, to compare two regions’ autosuggestions for a set of names, we sum over their Damerau-Levenshtein distances with respect to their autosuggestion strings for each of these names.

In the following two sections, we look at how Bing’s autosuggestions for English letter and simplified Chinese character names vary across locales.

Comparison of English letter names’ suggestions

In this section, by using hierarchical clustering, we look at how Bing’s autosuggestions for English letter names, including their censorship, vary across 41 different locales.

The distance between each locale’s autosuggestions for English letter names hierarchically clustered according to the centroid method.
Figure 3: The distance between each locale’s autosuggestions for English letter names hierarchically clustered according to the centroid method.

As we might expect, we found autosuggestions largely clustered around both geography and language (see Figure 3). Beginning at the top left, we find a cluster of East Asian locales (“zh-CN Intl.” through “ja-JP”) as identified by a yellow square. Next, below and to the right, we find a large cluster of European-language but non-English locales (“fr-CH” through “pt-BR”) spanning Europe and Latin America in a large yellow square. Further below and to the right, we find two small clusters of English-speaking locales (“en-CA” through “en-GB” and “en-ZA” through “en-ID”). It is unclear why these locales form two clusters and not one. For instance, both clusters include British commonwealth nations and both include Eastern Hemisphere nations. However, as a whole, it is not surprising that English-language locales would generally cluster in their autosuggestions for English letter names. Finally and most noteworthy, in the bottom right corner is a cluster containing the United States (“en-US”) and the China international version as accessed from both inside mainland China (“zh-CN Intl. VPN”) and outside (“zh-CN Intl.). This cluster is remarkable in not only how similar the United States’ autosuggestions are to those of the China international version but also how different they are from every other locale as evident from the dark blue bars across the top side and across the left side of the plot.

Given the large number of United States Bing users, the fact that Microsoft is based in the United States, and the United States’ status as an Internet hegemon, it is perhaps not surprising to see its autosuggestions different from other locales. However, what is less clear is why the United States’ autosuggestions are so similar to that of China’s, including in their Chinese political censorship. While we might imagine, since the international version of Bing’s China search engine was developed as an English language search engine for Chinese users, that it might be a thin wrapper around the search engine in the United States, what is less obvious is how Chinese politically motivated censorship is moving in the other direction, from the international China search engine to the United States.

Comparison of Chinese character names’ suggestions

In this section, we apply hierarchical clustering toward looking at how Bing’s autosuggestions for Chinese character names, including their censorship, vary across 41 different locales.

The distance between each locale’s autosuggestions for Chinese character names hierarchically clustered according to the centroid method.
Figure 4: The distance between each locale’s autosuggestions for Chinese character names hierarchically clustered according to the centroid method.

As with autosuggestions for English letter names, we found Chinese character names’ autosuggestions clustered around both geography and language (see Figure 4). Beginning again at the top left, we find a large cluster of non-East-Asian language locales (“en-GB” through “en-IN”). As, outside of East Asia, Chinese characters are primarily used only by native Chinese speakers, it may be unsurprising that these regions form such a large cluster. Next, moving toward the bottom right, we find South Korea (“ko-KO”) in a cluster by itself, China’s domestic and international versions as accessed from mainland China (“zh-CN Dom. VPN” and “zh-CN Intl. VPN”) in a cluster, and then Japan (“ja-JP”), Taiwan (“zh-TW”), and Hong Kong (“zh-HK”) each in their own singleton clusters. Finally, in the bottom right corner, we have a three member cluster containing the United States (“en-US”) and both China’s domestic and international versions as accessed from outside of China (“zh-CN Dom.” and “zh-CN Intl.”).

Unlike with English letter names, where we might imagine that Bing is reusing the United States’ autosuggestions to implement their English language international version of their China search engine, it is unclear why, with Chinese character names, the United States also shares a large number of autosuggestions with mainland China’s domestic search engine. However, we believe that the answer to this question might play a part in answering how the United States experiences Chinese political censorship of Chinese character names.

Glitches in the matrix

In analyzing Bing’s censorship of autosuggestions across different locales, in addition to the name censorship that we described above, we also encountered other strange anomalies. We describe a few here to help characterize the inconsistency we often observed in Bing’s censorship and in the hope that such descriptions may be otherwise helpful for understanding Bing’s autosuggestion censorship system.

Jeff Widener

For some names, we found evidence of censorship, even though they did not meet all of our criteria to be considered censored. For instance, consider photographer Jeff Widener, who is famous for his photos of the June 4 Tiananmen Square protests.

“Jeff␣Widene” “Jeff␣Widener” “Jeff␣Widener␣”
jeff widener ou center for spatial analysisejeff widener pics

jeff widener photography

Table 15: Autosuggestions for three queries for Jeff Widener in the United States English language locale, one missing the last letter, one his full name and nothing more, and one his full name followed by a space.

Using our stringent criteria, we do not consider his name censored in the United States English language locale, as our criteria require that three specific queries fail to autosuggest his name, whereas in our testing only two failed to autosuggest him (see Table 15 for details). Nevertheless, his name shows signs of censorship, as two of the three queries failed to provide any autosuggestions, and it was only by adding a space to the end of his name that we were finally able to see autosuggestions. This observation is especially curious as one would generally expect fewer suggestions as one types, since typing more can exclude autosuggestions if they do not begin with the inputted text.

Looking at autosuggestions for his name in other locales, his name fails to provide any autosuggestions when queried from a mainland China IP address, whereas in the Canada English language locale, there are autosuggestions for his name in all three of the tests in Table 15. Thus, we suspect that names such as his are for some reason being filtered only on some inputs and not others in the United States English language region. The reasons for this inconsistent filtering are not clear. However, we suspect that, if we understood them, then they may shed light on why Bing censors autosuggestions in the United States and Canada for Chinese political sensitivity at all.

Hillary Clinton and Alex Jones

In many locales, including the United States English language locale, the queries “alex␣jone”, “alex␣jones”, “hillary␣clinto”, and “hillary␣clinton” also show signs of censorship following a pattern resembling Jeff Widener. However, instead of these queries being completely censored, we found that their autosuggestions only contained their name immediately followed by punctuation such as “hillary clinton’s eyes” or “alex jones+infowars” (see Figure 5).

November 2021 autosuggestions for “alex␣jones” “hillary␣clinton” in the United States English locale never contain their bare names nor their names followed by spaces.
Figure 5: November 2021 autosuggestions for “alex␣jones” “hillary␣clinton” in the United States English locale never contain their bare names nor their names followed by spaces.

One might imagine how Bing could be using some poorly written regular expression which unintendedly fails to filter their names when followed by certain punctuation (e.g., by searching if /(^|␣)hillary␣clinton(␣|$)/ is present). Curiously, as with Jeff Widener, we found that when their names are followed by spaces (i.e., “alex␣jones␣”, “hillary␣clinton␣”), we see autosuggestions consistent with our expectations.

Alex Jones may have been targeted for censorship for propagating COVID-19 anti-vaccinationism or other misinformation. It is unclear why Hillary Clinton may be targeted.

Top, October 2021 autosuggestions for “president␣xi␣” in the United States English locale suggested “president xi jing” and “president xi jinping's” but seemed unable to utter “president xi jinping”; bottom, May 2022 autosuggestions for “mr␣xi␣j”, again only suggesting typos and “jinping” followed by an apostrophe.
Figure 6: Top, October 2021 autosuggestions for “president␣xi␣” in the United States English locale suggested “president xi jing” and “president xi jinping’s” but seemed unable to utter “president xi jinping”; bottom, May 2022 autosuggestions for “mr␣xi␣j”, again only suggesting typos and “jinping” followed by an apostrophe.

We also found that in many regions, including the United States, the query “xi␣jinping” yielded no autosuggestions. However, once an apostrophe or other punctuation symbol is inputted after his name, then Bing displayed autosuggestions (e.g., “xi␣jinping” has no autosuggestions but “xi␣jinping’s” has autosuggestions). Sometimes the system could be coerced into providing an apostrophe-containing autosuggestion without typing one in (see Figure 6).

Other censorship findings

While our report focuses on testing people’s names, some casual testing also reveals that other categories of proper nouns politically sensitive in China are also censored in the United States English language locale. Examples include “falun”, a reference to the Falun Gong political and spiritual movement banned in China and “tiananmen” and “june␣fourth”, the place and day of the June 4 Tiananmen Square massacre.

Motivated by the discovery of “june␣fourth” being censored, on May 18, 2022, we performed an experiment testing for the censorship of dates. We tested all 366 possible days of the year written as a month followed by an ordinal number, e.g., “january␣first”, “january␣second”, etc. We used a similar methodology as before, testing three different queries for each date. We tested from North America the following locales: China international, United States English, and Canada English.

China international version United States English Canada English
june␣fourth

june␣fourth␣

june␣fourth

june␣fourth␣

Table 16: In each locale tested, queries which had no suggestions.

China international version United States English Canada English
june␣fourt

june␣fourth

june␣fourth␣

june␣fourt

june␣fourth

june␣fourth␣

june␣fourt

july␣twenty␣fourth␣

Table 17: In each locale tested, queries which were “suggestionless.” Recall that we use the term suggestionless to refer to queries where none of the autosuggestions contain original query, including if there are no autosuggestions.

We found that in both the China international version and the United States English locale, two of the three tested queries related to “june␣fourth” had zero autosuggestions (see Table 16). No other days had zero autosuggestions in these regions, and no days had zero autosuggestions in the Canada English locale. We also found that in both the China international version and the United States English locale, all three of the tested queries related to “june␣fourth” were suggestionless, i.e., none of their autosuggestions contained the original query, including if there were no autosuggestions (see Table 17). No other days were suggestionless in these regions. In Canada, two queries were suggestionless: “june␣fourt” and “july␣twenty␣fourth␣”.

While we are unaware of any Chinese political significance to the date July 24, which had a single suggestionless query in Canada, the remainder of these results pertain to June 4. Due to the extreme sensitivity surrounding the June 4 Tiananmen Square Massacre, previous research has found that references to this day are one of the most commonly censored references on the Chinese Internet. Like with our investigation into Bing’s censorship of names, we are aware of no explanation for why in North America Bing would disproportionately censor this day among all days outside of Chinese-motivated political censorship.

Censorship implementation

While our report thus far has concentrated on understanding what Bing censors, in this section we speculate on how Bing censors, specifically the filtering mechanism that Bing applies toward censoring autosuggestions. We believe that the basic censorship mechanism may work something as follows:

  1. Given an inputted string, retrieve up to the top n (for some n > 8) autosuggestions for that string.
  2. Apply regular expression or other filters to these autosuggestions to remove autosuggestions which match certain patterns.
  3. Among the remaining autosuggestions, display the top 8.

Such a mechanism would explain how, for instance, when one is typing in “xi␣jinping”, at first, after typing in only “xi”, there are still eight results, although, contrary to expectation, none of them containing Xi Jinping’s name. However, as one continues to type “xi␣jinping”, such as reaching as far as “xi␣j”, all of the available autosuggestions now contain Xi Jinping’s name, and thus all of the available autosuggestions are all censored by a filter targeting him, and so Bing displays no autosuggestions. While such a filtering mechanism is compatible with our findings, we are unaware of any test to conclusively determine whether this is the exact mechanism used. Moreover, this explanation does not attempt to address whatever complex ways autosuggestion censorship from some locales may be bleeding into other locales.

Affected services beyond Bing

Due to how Bing is built into other Microsoft products and due to how other search engines source Bing data, even users who do not use Bing’s search website may still be affected by the reach of its autosuggestion censorship. In this section, we set out some of the other products and services that we found affected.

Windows Start Menu

We found Windows’ Start Menu search restricted by Bing’s censorship in both Windows 10 and Windows 11. This feature is accessed in Windows by opening the Start Menu and then typing, and it is used by Windows users to not only find search results from the Web, but it is also a common way to search for locally installed apps or locally stored documents.

In the United States English locale, Windows Start Menu search shows autosuggestions for “xi” not related to Xi Jinping and not any autosuggestions for “xi␣j”
Figure 7: In the United States English locale, Windows Start Menu search shows autosuggestions for “xi” not related to Xi Jinping and not any autosuggestions for “xi␣j”

Windows Start Menu censorship varies depending on how the Windows region settings are configured. We found that the Bing autosuggestions displayed in the Start Menu appeared consistent with the region selected as the Windows “Country or region” in Windows’ “Region” settings. For instance, when Windows is configured with the United States region, we found all autosuggestion results for “xi␣j” censored (see Figure 7) but not when configured with the Canada region.

Of note, the Windows Start Menu appears to strip trailing spaces from queries. Thus, we were unable to directly test “xi␣”. However, in our experience, for purposes of autosuggestions, Bing interprets underscores as equivalent to spaces, and we found that “xi_” (“xi” followed by an underscore) produced no autosuggestions in consistency with the Web interface.

Microsoft Edge

Microsoft Edge is a cross-platform browser available for Windows, MacOS, and Linux that comes installed in the latest versions of Windows. We found that, by default, Edge uses Bing for its built-in search functionality and that its autosuggestions are also censored.

In the Canada English locale, Microsoft Edge shows no autosuggestions containing “习近平” (Xi Jinping).
Figure 8: In the Canada English locale, Microsoft Edge shows no autosuggestions containing “习近平” (Xi Jinping).

Edge appears to use one’s IP geolocation to select a default region, and we were unable to find any setting to change the region governing the search autosuggestions of Edge’s built-in Bing implementation.

DuckDuckGo

Billed as a privacy-protecting search engine, we found that DuckDuckGo autosuggestions are nevertheless affected by Bing’s autosuggestion censorship. This finding is likely due to DuckDuckGo having a close partnership with Bing for providing data.

 DuckDuckGo provides no autosuggestions for “xi␣”
Figure 9: DuckDuckGo provides no autosuggestions for “xi␣”

Although we did not extensively analyze censorship on DuckDuckGo, we found, for instance, that DuckDuckGo provides no autosuggestions for “xi␣” in its default region setting of “All regions” when we browsed from Canada. When explicitly setting DuckDuckGo’s region to that of the United States or Canada, the autosuggestions appeared to track the autosuggestions that Bing’s Web interface directly provides in those regions, including any censorship.

Autosuggestion censorship impact

How do autosuggestions and their censorship impact search behavior? One way to approach this question is to find a large sample of people and divide them into two groups, providing autosuggestions to one and not providing them to another and measuring how searches differed in each group. Another approach is to offer one group spurious autosuggestions while giving organic autosuggestions to another. While these approaches would work, they are unnecessary, as Bing, when it disables/enables autosuggestions in a country or offers autosuggestions that no one would organically search for, is already performing such an experiment for us. Moreover, by using Bing’s historical search volume data, we can measure such an experiment’s results. Thus, to better understand how autosuggestions and their censorship influence search behavior, we analyze Bing’s historical search volume data with respect to the following three data points: (1) the spurious autosuggestions for Hillary Clinton and Alex Jones and the shutdown of autosuggestion functionality in China in mid-December 2021 to early January 2022 and its effect on (2) benign searches for food and (3) searches for the controversial Falun Gong movement.

The spurious autosuggestions for Hillary Clinton and Alex Jones allow us to examine how spuriously introduced autosuggestions influence search trends, whereas the autosuggestion shutdown in China allows us to examine how the absence of autosuggestions influence search trends. Unlike previously in our report when we took a rigorous statistical approach, in the remainder of this section we will only introduce different cases where autosuggestions seem to have influenced what users search for, as a more rigorous analysis of how autosuggestions shape search behavior is outside the scope of this work.

In our research for this section, we also discovered a misleading flaw in the way that Bing’s Keyword Research tool visualizes search volume. As a result, in this section we relied entirely on the raw data returned by Bing’s API. We detail this issue in the Appendix.

Hillary Clinton and Alex Jones

In this section, we look at how artificially introduced autosuggestions alter search behavior. Specifically, we look at social trend data surrounding the spurious autosuggestions for “Hillary␣Clinton” and “Alex␣Jones” to look at how artificially introduced autosuggestions influence search results. During our December 2021 measurements, Bing provided the following autosuggestions for “Hillary␣Clinton” in the United States English locale:

  • hillary␣clinton-age
  • hillary␣clinton’s␣age
  • hillary␣clinton’s␣boyfriend
  • hillary␣clinton’s␣meme
  • hillary␣clinton’s␣accomplishment
  • hillary␣clinton’s␣women’s␣rights␣speech
  • hillary␣clinton’s␣college
  • hillary␣clinton/age

As we discussed earlier in the report, Bing’s autosuggestions for Hillary Clinton are anomalous, as Clinton’s name is never suggested by itself nor followed by a space and in every suggestion her name is followed by punctuation. While many searches beginning with “hillary␣clinton’s” (“hillary␣clinton” followed by an apostrophe and “s”) such as “hillary␣clinton’s␣age” may be naturally typed, we find it unlikely that there are many queries typed in with her name followed by hyphens or slashes such as “hillary␣clinton-age” or “hillary␣clinton/age”. Nevertheless, we find these spurious autosuggestions suggested by Bing.

Idea Query Symbol following name Suggested?
Search volume
Hillary Clinton’s age hillary␣clinton-age Hyphen Yes
13,710
hillary␣clinton␣age Space No
10,212
hillary␣clinton’s␣age Apostrophe No
1,637
hillary␣clinton/age Slash Yes
51

Table 18: Queries relating to Hillary Clinton’s age, whether they were autosuggested by Bing upon inputting “Hillary␣Clinton”, and their search volume between October 2021 and April 2022.

In Table 18, we can see that the spurious autosuggestions resulting from Clinton’s name were competitive against the organic search queries. Notably, the query where her name was followed by a hyphen received more search volume than the query where her name was followed by a space or apostrophe, although we do find that the query where her name was followed by a slash received the least search volume. While this finding provides evidence that users do make use of autosuggestions when searching, it does not necessarily tell us to what extent autosuggestions influence users’ behavior at a high level. After all, it is possible that most of the users who clicked on “hillary␣clinton-age” intended to search for Hillary Clinton’s age anyways, just using more typical punctuation. Nevertheless, the search volume dominance of the hyphen-containing query suggests that, to whatever end, users do click on autosuggestions.

We found that Alex Jones’s autosuggestions in the United States English locale follow a similar pattern:

  • alex␣jones
  • alex␣jones+infowars
  • alex␣jones-infowars
  • alex␣jones-youtube
  • alex␣jones’s␣house
  • alex␣jones’s␣dad
  • alex␣jones’s
  • alex␣jones/911
Idea Query Symbol following name Suggested?
Search volume
Alex Jones & Infowars alex␣jones+infowars Plus sign Yes

41,165

(trend data cannot distinguish between spaces and plus signs)

alex␣jones␣infowars Space No
alex␣jones-infowars Hyphen Yes
19,438
Alex Jones & YouTube alex␣jones-youtube Hyphen Yes
150
alex␣jones␣youtube Space No
20
Alex Jones & September 11 alex␣jones/911 Slash Yes
0
alex␣jones␣911 Space No
0

Table 19: Queries for Alex Jones, whether they were autosuggested by Bing upon inputting “Alex␣Jones”, and their search volume between October 2021 and April 2022.

In Table 19, we can again see spurious autosuggestions motivating a large amount of search volume. Due to limitations in Bing’s API in which spaces and plus signs are treated identically, we cannot distinguish between search volume for queries in which his name is followed by a space versus those where his name is followed by a plus sign. However, for queries related to Jones and his show “Infowars,” the query where his name is followed by a hyphen has almost half as much search search volume as those of his name followed by either a space or plus sign combined. Regarding queries for Jones and his YouTube videos, the query where his name is followed by a hyphen has 7.5 times as much search volume as the one followed by a space. Again, we cannot say if these spurious hyphen-containing queries are pulling users away from other searches that they might have typed concerning Alex Jones or if users already intending to find Jones’s YouTube videos are simply clicking on the oddly punctuated autosuggestion instead of typing out a more naturally punctuated one. Nevertheless, as with our findings concerning Clinton, these findings concerning Jones show that users do click on Bing’s autosuggestions.

Shutdown of autosuggestion functionality in China

Microsoft’s shutdown of autosuggestion functionality in China offers us the opportunity to test how autosuggestions affect search behavior by allowing us to compare search behavior in their presence versus their total absence. We begin by looking at the autosuggestions for three benign words: 洋葱 (onion), 西瓜 (watermelon), and 榴莲 (durian).

Figure 10: Search trends for autosuggestions for 洋葱 (onion) not affected (left) and affected (right) by the shutdown.
Figure 10: Search trends for autosuggestions for 洋葱 (onion) not affected (left) and affected (right) by the shutdown.
Figure 11: Search trends for autosuggestions for 西瓜 (watermelon) not affected (left) and affected (right) by the shutdown.
Figure 11: Search trends for autosuggestions for 西瓜 (watermelon) not affected (left) and affected (right) by the shutdown.
Figure 12: Search trends for autosuggestions for 榴莲 (durian) not affected (left) and affected (right) by the shutdown.
Figure 12: Search trends for autosuggestions for 榴莲 (durian) not affected (left) and affected (right) by the shutdown.

In general, we find that the autosuggestions most affected by the shutdown are longer (see Figures 10, 11, and 12). We hypothesize that these are less likely to be naturally searched and thus benefit more from being autosuggested.

Falun Gong

We now turn our attention to Bing’s autosuggestions in China for the controversial 法轮功 (Falun Gong) spiritual and political movement. We found that, in addition to “法轮功/flg”, where “flg” is an abbreviation of Falun Gong, China Bing provided the following autosuggestions for “法轮功” (Falun Gong):

  • “法轮功␣危害” (Falun Gong dangers)
  • “法轮功常见非法宣传品” (Falun Gong Common Illegal Propaganda Materials)
  • “法轮功什么时候被国家取缔” (When was Falun Gong banned by the state?)
  • “法轮功邪教组织什么时候被取缔” (When was the Falun Gong cult banned?)
Figure 13: Search trends for 法轮功 (Falun Gong) (left) and for its autosuggestions (right).
Figure 13: Search trends for 法轮功 (Falun Gong) (left) and for its autosuggestions (right).
Figure 14: In blue, the trend data for 法轮功 (Falun Gong), and in red, the aggregate (sum) data for all of its suggestions.
Figure 14: In blue, the trend data for 法轮功 (Falun Gong), and in red, the aggregate (sum) data for all of its suggestions.

While Bing reports erratic trends data for Falun Gong related autosuggestions, we find that none of these autosuggestions for Falun Gong had any trend data during the four weeks of Bing’s autosuggestion shutdown in China (see Figures 13 and 14). This finding in addition to the seemingly artificiality and one-sidedness of these autosuggestions suggests that these autosuggestions are influencing searches rather than being induced by them.

To compare Bing’s autosuggestions for “法轮功” to other regions, we find that Bing provides no autosuggestions in the United States English and Canada English locales, consistent with the politically motivated censorship in these locales that we measured earlier in our report. Chinese search engine Baidu also provides no autosuggestions for “法轮功”. However, Google in Canada provides the following autosuggestions:

  • 法轮功␣英文 (Falun Gong in English)
  • 法轮功␣自焚 (Falun Gong self-immolation)
  • 法轮功官网 (Falun Gong official website)
  • 法轮功是什么 (What is Falun Gong)
  • 法轮功真相 (The truth about Falun Gong)
  • 法轮功创始人 (Founder of Falun Gong)
  • 法轮功媒体 (Falun Gong media)
  • 法轮功␣资金来源 (Falun Gong sources of funding)
  • 法轮功现状 (Status of Falun Gong)
  • 法轮功电影 (Falun Gong Movies)

We must be careful in comparing the autosuggestions of Bing versus Google, as these companies use different algorithms and are used by different populations, and thus we would already expect them to have different autosuggestions for completely benign reasons. However, we do find that Google’s autosuggestions are shorter, consistent with something that might be organically typed in by a user, and that the autosuggestions are well balanced, with many supporting of and many critical of the Falun Gong movement.

While it makes sense to ask whether Bing introduces autosuggestions at the behest of the Chinese government, the question may be of little practical significance. Whether Bing introduces autosuggestions or whether they censor all but certain autosuggestions, the practical result would be the same, that Bing is artificially influencing searches in keeping with China’s propaganda requirements.

Limitations

We found that in each region the names censored by Chinese political censorship varied over time. To measure a consistent snapshot of Bing’s censorship, we tested during a short period of time (one week in December, 2021). However, many of the examples we illustrate may no longer be censored, or other examples which we found not to be censored may now be. Later in this report, we discuss what the inconsistency of the Chinese political censorship across different regions says about why such censorship is affecting regions outside of China.

In our statistical analysis, there may exist confounding variables that we failed to account for. For English letter names, we tested whether Bing was merely failing to provide autosuggestions for sensitive Chinese political names in the United States because such names were more likely to be written in pinyin, a type of name that Bing may have struggled to provide autosuggestions for. However, we found that pinyin names failed to account for Bing’s Chinese political censorship. Moreover, we are unaware of any confounding variables that might explain why Bing politically censors Chinese character names. Even if such a confounding variable existed that lent some innocuous explanation for Bing’s Chinese political censorship, it would not change our finding that Bing disproportionately fails to provide autosuggestions for the names of people who are politically sensitive in China, regardless of explanation.

Using our methodology, a small number of individuals’ names appeared censored by Bing under no motivation that we could identify. One reason for this could be that we simply failed to identify some straightforward motivation. Another reason is that the name was collaterally censored and that we failed to recognize that the individual’s name included letters consisting of profanity in English or some other language. Although we did discover names collaterally censored for containing letters commonly considered profane in non-English languages, we had limited ability to exhaustively recognize such names. Finally, the name may have simply appeared censored due to being a false positive, somehow not having any autosuggestions despite having both high Wikipedia traffic and large Bing search volume but yet also not being targeted for censorship by Bing in any region. However, since such false positives are, by definition, not being targeted by Bing, we would expect such failures to not be disproportionately politically sensitive in China compared to names chosen uniformly at random, and thus such names would be unable to explain our statistical findings.

Discussion

Our research shows that Bing’s Chinese political censorship of autosuggestions is not restricted to mainland China but also occurs in at least two other regions, the United States and Canada, which are not subject to China’s laws and regulations pertaining to information control. To our knowledge, Bing has not provided any public explanation or guidelines on why it has decided to perform censorship in various regions or why it has censored autosuggestions of names of these individuals.

On May 10, 2022, we sent a letter to Microsoft with questions about Microsoft’s censorship practices on Bing, committing to publishing their response in full. Read the letter here and the email response here that Microsoft sent on May 17, 2022.

Search engines are a major interface between users and the Internet. They serve as a gateway to online information, which to a large extent influences user attention and knowledge. The impacts of search engine results on the visibility and identity of a person, an organization, and even a country have been documented by many. Bing’s censorship, therefore, not only affects how users perceive certain entities but also dictates whether users get to know the existence of these entities. What might explain Bing’s censorship decisions in and beyond mainland China? We evaluate multiple hypotheses below.

Does Bing’s development team follow Chinese principles and norms concerning political expression?

After Microsoft’s 2021 blocking of images relating to the 1989 Tiananmen Square Movement, journalists speculated whether much of Bing’s development team being based in China contributed to that mistake. There are previous instances of China-headquartered companies and companies with development teams in China implementing censorship targeting Chinese political sensitivity on global-facing products. ByteDance’s TikTok, for example, was found to instruct moderators to censor videos referencing Tiananmen Square, Falun Gong, and other topics considered politically controversial in China despite it not offering services to China-based users. In 2020, a TikTok executive admitted that the app censored content critical of the Chinese government but insisted that it has terminated relevant content moderation policies favoring the views of the Chinese government. Since much of Bing’s development team is based in China, Bing’s developers, who are likely trained to follow Chinese laws and regulations in their everyday practices, may have similarly believed it appropriate to censor such content globally or may have been less likely to recognize if such content were being restricted accidentally.

Is Microsoft censoring outside of China to appease the Chinese government?

We might suspect that Microsoft performs Chinese political censorship globally in order to appease the Chinese government, perhaps as part of the conditions in which Microsoft is allowed to continue operating Bing and other Microsoft services inside of China. However, accepting this hypothesis requires caution for the following reasons. First, there is no explicit evidence suggesting Bing’s censorship behavior is resulting from concessions made to the Chinese government. While much is known about the concessions that fellow North American tech giant Apple made to be allowed to operate in China, comparatively little is known about those made by Microsoft. Second, in a previous instance in which Bing was found to censor image results for the query “tank man” in North America, Microsoft attributed the censorship to “accidental human error” and quickly ceased performing the censorship. Finally, the inconsistency over time of which sensitive Chinese political names are censored in North America may suggest that this censorship is unintentional and the emerging property of some complex system rather than solely the deterministic application of a set of rules.

Nevertheless, while Microsoft may not be performing Chinese political censorship globally to appease the Chinese government, they are certainly performing Chinese political censorship in China to appease the Chinese government. Bing’s global Chinese political censorship would seem inextricably linked to Microsoft’s censorship operations in China. While it is unclear why Microsoft’s Chinese political censorship is leaking outside of China, it seems that it could not have leaked out had Microsoft never engaged in such censorship operations in the first place.

Is Bing’s censorship of users globally an unintentional result of their censorship of Chinese users?

We might suspect that Bing’s censorship of users in China is impeding these users from making certain search queries, and, since historical search behavior is used to inform autosuggestions, the search behavior of censored Chinese users is being used to inform the autosuggestions for users globally. Search engines’ autosuggestions are based largely on ranking signals which include, among other indicators, how many users have submitted a certain keyword for search and how many times the suggestions have been selected in the past. In our earlier cross-region comparison, we discovered that the autosuggestions between mainland China and the United States shared a large amount of overlap, suggesting that a shared data source was used to inform autosuggestions in both regions. Thus, it is possible, since Bing’s censorship of autosuggestions in China affects the search behavior of Chinese users, and since autosuggestions in other regions are based, in part, on the search behavior of Chinese users, that this is how Bing’s autosuggestion censorship is leaking globally.

However, even if this mechanism played a role in explaining the behavior that we observed, it would not seem to be a complete explanation. While Bing’s autosuggestion censorship may effectively hide sensitive political names such as those of dissidents from Chinese users, it also has a second purpose which is to hide sensitive suggestions concerning Communist Party leaders. While autosuggestion censorship might prevent a Chinese user searching for “Xi Jinping” from learning about negative narratives surrounding the paramount leader such as the Winnie the Pooh mockery, it would not generally prevent a user from searching for Xi Jinping in general. Thus, this hypothesis seems unable to explain the global autosuggestion censorship of any reference to Communist Party leaders such as Xi Jinping.

Is Chinese users’ self-censorship affecting autosuggestions globally?

We might also wonder if Bing’s global Chinese political censorship is resulting from users in China self-censoring their searches and if this self-censorship search behavior is being used to inform the autosuggestions of users outside of China. However, this hypothesis appears to be, at best, an incomplete explanation for similar reasons as the previous hypothesis. While some names are inherently sensitive, such as Liu Xiaobo, others are not inherently sensitive, such as Xi Jinping or the names of other Communist Party leaders. Thus, it does not explain why we do not see autosuggestions for Xi Jinping in the United States when Chinese users would have no reason to self-censor their searches for Xi Jinping.

Can Microsoft even permanently address this problem?

The idea that Microsoft or any other company can operate an Internet platform which facilitates free speech for one demographic of users while intrusively applying political censorship to another demographic of its users may be fundamentally untenable. Our prior research has previously documented failures in such attempts.

In previous work, we discovered that not only did the Chinese version of Microsoft’s Skype perform extensive keyword-based political censorship and surveillance but that such surveillance logs containing users’ private messages were inadvertently exposed to the public. Skype’s censorship and surveillance also applied globally to users outside of China when they were communicating with Chinese users, with or without their knowledge, and it finally ended when Microsoft abandoned bundling censorship and surveillance in its Chinese version of Skype.

Tencent’s WeChat also operates a platform in which it politically censors users in China while attempting to facilitate uncensored speech to users outside of China. Like with Skype, WeChat users out of China can have their communications unknowingly captured by WeChat’s censorship and surveillance system when they are communicating with China-based users. However, recent work also discovered that, even when two WeChat users who are both outside of China are communicating between each other, their communications were silently being surveilled for Chinese political sensitivity and used to train and build up WeChat’s political censorship apparatus that it applies to users in China.

Recent work documents Apple’s failure to contain their Chinese political censorship of users’ product engraving text to only China. In our work we discovered that Apple applied mainland Chinese political censorship to users in Taiwan for reasons most likely stemming from negligence. Since our first report, Apple has ceased political censorship of Taiwan users’ engravings. However, we found that Apple continues to politically censor users in Hong Kong. As other North American tech companies do not perform similar political censorship in Hong Kong, we speculated over possible motivations Apple may have for performing it, including appeasement of the Chinese government.

We may hope that resolving whichever series of decisions, errors, bugs, or glitches which gave rise to the issues we discovered in this report might fix these issues. However, this is not the first time that Microsoft has allowed Chinese political censorship on Bing to be applied to users globally, and there is little assurance that it would be the last. In light of our past research, the findings in this report again demonstrate that an Internet platform cannot facilitate free speech for one demographic of its users while applying extensive political censorship against another demographic of its users. One method for Microsoft to assure its users outside of China that they are free from its Chinese political censorship would be for Microsoft to shutter its Chinese political censorship operations in their entirety. However, Microsoft may find this solution undesirable because doing so would mean violating content control laws and regulations in China and upsetting the Chinese government, which would fundamentally affect Microsoft’s operations in the Chinese market. A second approach would be for Microsoft to bifurcate their search operations into two completely separate operations, similar to what Microsoft has recently done with their LinkedIn platform. While bifurcating a globalized product also further splinters the Internet and risks cutting off China-based users access to external information, absent such changes, it is unclear whether Microsoft can feasibly operate an Internet platform in which it can provide free speech for some of its users while violating such rights for the rest.

Acknowledgments

We would like to thank Jedidiah Crandall, Miles Kenyon, and Gabby Lim for helpful comments and peer review. Funding for this research was provided by foundations listed on the Citizen Lab’s website. Research for this project was supervised by Masashi Crete-Nishihata and Professor Ron Deibert.

Availability

The list of names that we discovered to be censored in each locale that we tested is available here.

Appendix: Bug in Keyword Research tool

In this section, we describe a subtle but potentially misleading flaw in the way that Bing’s Keyword Research tool visualizes data.

Figure 15: Top, Bing’s Keyword Research Tool results for “李文亮的英雄事迹” (Li Wenliang’s heroic deeds) which fails to plot weeks with zero search volume; bottom left, our plot of the raw data returned by Bing’s API attempting to reproduce the Research Tool’s bug by excluding zero-volume weeks; bottom right, our plot of the raw data returned by Bing’s API correctly including zero-volume weeks.
Figure 15: Top, Bing’s Keyword Research Tool results for “李文亮的英雄事迹” (Li Wenliang’s heroic deeds) which fails to plot weeks with zero search volume; bottom left, our plot of the raw data returned by Bing’s API attempting to reproduce the Research Tool’s bug by excluding zero-volume weeks; bottom right, our plot of the raw data returned by Bing’s API correctly including zero-volume weeks.

We found that the plot generated by the tool incorrectly collapsed weeks without any search volume, making the plot look continuous even if a large gap existed in search volume. The only hint that the tool gives that the plot is missing data is that, despite a certain range (e.g., six months) being requested, a shorter range is visualized (e.g., in Figure 15, mid-October to mid-January, as opposed to the six month range of mid-October to mid-April that was requested). Erroneously, the plot thus shows search volume for weeks which actually had none.

Appendix: Comments on Microsoft’s response

We would like to thank Microsoft for considering and replying to our letter. With respect to some of our findings from our December 2021 experiment being no longer reproducible in May 2022, we recognize in our report that the censorship of autosuggestions which we characterize through our research fluctuates over time. However, we have observed that the direction of fluctuation is not always in the direction of reducing censorship.

With respect to Microsoft’s discovery and resolution of a misconfiguration preventing valid autosuggestions from appearing, we are happy that our research led to the discovery and resolution of such a misconfiguration. However, aside from general fluctuations, we are unaware of any change in Bing’s tendency to censor autosuggestions which are politically sensitive in China. For example, in the “Other censorship findings” section, we perform an experiment testing all 366 possible days of the year, finding only the most sensitive day in China, “june␣fourth”, the day of the Tiananmen Square Massacre, was censored in the United States English locale. This experiment was performed on May 18, 2022, after the receipt of Microsoft’s response.

]]>