@qq383762126 opened this Issue on April 17th 2018

Can we solve the problem of "GBK coding" in Chinese search engine?

@Findus23 commented on April 17th 2018 Member

Hi, can you further describe what issue you are referring to and how to reproduce it?

@qq383762126 commented on April 17th 2018

Similar to the following Chinese search engines, GBK coding is not UTF-8 encoding, and matomo is not recognized, all converted to UTF-8 recognition, so the search term is chaotic.

http://www.sogou.com/web?query=%E4%B8%93%E5%88%A9%E6%9F%A5%E8%AF%A2&ie=utf8&_ast=1523950080&_asf=null&w=01029901&cid=&s_from=result_up&sut=6728&sst0=1523950055276&lkt=1%2C1523950053745%2C1523950053745&sugsuv=001D5390DED1583F5A8E5819DE6D1320&sugtime=1523950055276

@sgiehl commented on April 21st 2018 Member

@qq383762126 The charset for Sogou is defined as gb2312. The search term detected for the URL you posted should be 专利查询.
Which version of Matomo are you using?

@fengkaijia commented on April 28th 2018 Contributor

I just checked my Matomo, I too have around 5% of records from Sogou being unreadable, for example, 杩琛¢】 or 娉缃nag (which has no meaning). But it only accounts for less than 5% of traffic from Sogou. My guess is, Sogou has a non-UTF8 interface version for users from some older system, like IE6 on Windows XP, and since my blog is about Linux, readers usually don't use Windows XP, so I didn't notice this 5% mojibake until now.

@Findus23 commented on May 24th 2020 Member

It sounds like those are browsers that are already sending invalid UTF-8 to Matomo, so there is little that can be fixed here. And as long as Matomo gets valid UTF-8 data now with https://github.com/matomo-org/matomo/issues/9785 it should be possible to store any unicode character.

This Issue was closed on May 24th 2020
Powered by GitHub Issue Mirror