Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese GBK code disorder code problem #12732

Closed
qq383762126 opened this issue Apr 17, 2018 · 5 comments
Closed

Chinese GBK code disorder code problem #12732

qq383762126 opened this issue Apr 17, 2018 · 5 comments
Labels
answered For when a question was asked and we referred to forum or answered it.

Comments

@qq383762126
Copy link

Can we solve the problem of "GBK coding" in Chinese search engine?

@Findus23
Copy link
Member

Hi, can you further describe what issue you are referring to and how to reproduce it?

@qq383762126
Copy link
Author

Similar to the following Chinese search engines, GBK coding is not UTF-8 encoding, and matomo is not recognized, all converted to UTF-8 recognition, so the search term is chaotic.

http://www.sogou.com/web?query=%E4%B8%93%E5%88%A9%E6%9F%A5%E8%AF%A2&ie=utf8&_ast=1523950080&_asf=null&w=01029901&cid=&s_from=result_up&sut=6728&sst0=1523950055276&lkt=1%2C1523950053745%2C1523950053745&sugsuv=001D5390DED1583F5A8E5819DE6D1320&sugtime=1523950055276

@sgiehl
Copy link
Member

sgiehl commented Apr 21, 2018

@qq383762126 The charset for Sogou is defined as gb2312. The search term detected for the URL you posted should be 专利查询.
Which version of Matomo are you using?

@fengkaijia
Copy link
Contributor

I just checked my Matomo, I too have around 5% of records from Sogou being unreadable, for example, 杩琛¢】 or 娉缃nag (which has no meaning). But it only accounts for less than 5% of traffic from Sogou. My guess is, Sogou has a non-UTF8 interface version for users from some older system, like IE6 on Windows XP, and since my blog is about Linux, readers usually don't use Windows XP, so I didn't notice this 5% mojibake until now.

@Findus23
Copy link
Member

It sounds like those are browsers that are already sending invalid UTF-8 to Matomo, so there is little that can be fixed here. And as long as Matomo gets valid UTF-8 data now with #9785 it should be possible to store any unicode character.

@Findus23 Findus23 added the answered For when a question was asked and we referred to forum or answered it. label May 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it.
Projects
None yet
Development

No branches or pull requests

4 participants