Which Chinese companies or websites take more notice of the Data Mining Techniques?

果壳网 Guokris quite active in the data mining field in China, maintaining C/Python/Node libraires for advanced Chinese NLP, topic modeling, etc.

Taobao has recently tried to diversify his data team by hiring more new talents coming from different backgrounds - not only CS, but also storytelling, graphic design,etc. They may be up to something interesting there but it is too early to say now.

As for companies like Douban or Weibo, staying up-to-date with the rules of the Chinese web keep their team busy with non-productive tasks (like posts deletion and this sort of things) which don ‘t allow them to fully investigate the opportunity of data mining. For independant companies and startups, many still have teams to do data collection and reporting mostly manually.

Fact is data mining technology for Chinese language still not very advanced and big universities in China doesn ‘t dedicate funds for this kind of research. So there is clearly a lack of qualified engineers and proper tools/libraries. It is changing and people like 数据堂 for instance are doing a great job. Many young students I see nowadays show also great interest in developing data mining progress. Beijing Ministry of Housing is even talking about an Open Data platform (there is already local ones, but nothing at the country level) There is also a pool of journalists forming eager to learn from CS and databases.

To sum it up, there is surely a great potential for data mining technology in China today but the stakes are very different : the biggest player that does real badass data engineering is the Parti, leaded by Fang Binxing and its team with the Great Firewall/Golden Shield/censorship project. That plus the fact that anything in China (including data) can be fake or go missing, that makes the whole environment tricky to deal with.

This text was originally published in quora.

A question? A comment?

Please send it to me by email bonjour@clementrenaud.com or on Twitter.