Quantified Cyber Censorship: Undergraduate Research of the CPC's Great Firewall

As you may know if you follow me on Twitter, I recently "doxed" myself on my Twitter and, by association, Steemit account. I will still go by my pseudonym for old time's sake, but I will now go by my "IRL" name Mitch. I'm an undergraduate student, graduating in May with a Bachelor's in Chinese and a Bachelor's in Computer Science.

Recently, the CS/IT department at my university began an undergraduate research program. I had been doing unfunded research with a professor beforehand, but the new program offered pay and publishing incentives. Since beginning the program, I have delved further into my research, far enough to have made progress, and far enough to put some details about it here.

Conquering the Wall... Or At Least Seeing it.

My goal going into my first research program was to combine both of my specialties, Computer Science and Mandarin Chinese, to make a project that makes a difference. I knew from first and secondhand experience how severely the Chinese intranet is censored, so I quickly decided that a censorship-based project was a solid plan to aim for.

What is one of the most difficult parts of tackling censorship? It's generally non-quantifiable. It's someone complaining about a post being taken down, talking about it to friends, occasionally writing a strongly worded Facebook memo to air out the grievance. These are all common signs of censorship... in a society where censorship is not severe and generally, if erroneous, is an algorithmic error and not an intentional one.

In China, the censorship is very intentional. Very little is said directly about censorship on the web, where Chinese netizens prefer the use of pseudonymous slang and speaking around the issue of censorship, if discussing it at all. There are reports of Gulags for online trouble makers, where people are shuttled away in the dark of night for something they said in a supposedly-private chat room. There are often screenshots of popular posts on social media that are censored just because they are popular, and dissidents like Ai WeiWei speak on censorship constantly, often very much at their own expense.

Reports and the occasional documentary, though, are hardly an adequate quantification of such a complex problem. There are no numbers to show, no correlative graph, and no inside knowledge on what is being censored, when, how heavily, and why. The trackers aren't being tracked.

That is where my research hits home. I am seeking to quantify censorship by monitoring different communities on popular Chinese social media site ZhiHu. The core of the project is relatively simple: farm potentially dissident material in political discussion forums and others that would generate a higher volume of "dangerous" material, check at certain intervals if the posts have been changed, removed, or otherwise erased, and, if they are, use machine learning and big data approaches to find correlation between posts that have been confirmed to be censored.

Through this mechanism, we begin to at least get an idea of the volume of posts that are being censored. The big data and machine learning approach adds in several technical and linguistic challenges, but the goal is to correlate what is being censored and when, in order for users and analysts to try to figure out why. If a certain subject, say discussion about an earthquake in FuJian province, is being heavily censored over one week, it's clear there is something the CPC does not want Chinese people to see or discuss. Correlated terms like 'earthquake', 'fujian', 'death toll', etc during a specific week would indicate a spike in censorship on the subject.

This is a huge undertaking. I have the rest of this semester and the next to tackle this project, and look forward to making more progress. I will be keeping the blog updated with progress and challenges, and look forward to the insight that we can all gain into the dark depths of the Great Firewall of China.

---------------------
Like the post? I run this threat intelligence blog on Steemit and offer the content free of charge. If you're a Steemit user, you know that upvoting, which you do for free, magically puts a couple cents in my pocket. Maybe I'll buy a pack of gum with last week's earnings, but it all depends on your help. Not a Steemit user? My biggest metric of success is my viewership. If I don't make a cent but my content reaches a wide audience, that means my product is valuable and my efforts are worthwhile. Therefore, give me a share on your social media of choice, follow me on Steemit for more threat intel posts, and follow me on Twitter to see stupid memes and get updates when I post.