@robocoder opened this Issue on September 11th 2010 Contributor

The current data file:

  • requires exact match on domains; as a result, we have numerous 'holes':
    • country code tlds (e.g., www.example.ca, www.example.us)
    • country subdomains (e.g., ca.example.com, us.example.com)
    • wildcard subdomains (e.g., *.example.com)

Proposal:

  • use Public Suffix List and/or regular expressions
  • backfill data from the master record at runtime to avoid unnecessary duplication
    • we already do this to some extent; in which case, we just need to prune some entries (e.g., Baidu)

Affects:

  • Piwik_Common::extractSearchEngineInformationFromUrl()
  • plugins/Referers/functions.php

ToDo:

  • strtolower in comment:1
  • iconv in comment:2
@robocoder commented on September 11th 2010 Contributor

I see we call strtolower on the keywords. This may not be safe to do with the 'C' locale unless it happens to be UTF-8 aware.

@robocoder commented on September 12th 2010 Contributor

Task: review the iconv() code in extractSearchEngineInformationFromUrl(). The keywords from naver.com are showing up empty. The encoding in SearchEngines.php is specified as x-windows-949 (which I gather is a superset of the search page's charset, euc-kr).

@robocoder commented on September 12th 2010 Contributor

(In [3136]) refs #1694 - detect powered by google custom search

@robocoder commented on September 13th 2010 Contributor

(In [3141]) refs #1694 - prune arrays (these will be backfilled from the master record)

Separate "Powered by Google" (i.e., uses Google exclusively for search) from "Enhanced by Google" (uses Google in addition to other search engines); the latter are treated as separately branded (meta) search engines.

@robocoder commented on September 13th 2010 Contributor

(In [3142]) refs #1694 - add unit tests for missing and obsolete search engine icons; adapted from halfdan's ProcessFavIcons.php in #1350

@robocoder commented on September 13th 2010 Contributor

(In [3144]) refs #1694 - add Piwik_Common::getLossyUrl($url) to reduce referrer URLs to
a more basic form/pattern. I'll prune the Google and Yahoo entries later.

@robocoder commented on September 13th 2010 Contributor

(In [3145]) refs #1694 - fix forestle.org and add unit test (i.e., {} can't appear in master record)

@robocoder commented on September 13th 2010 Contributor

(In [3146]) refs #1694 - update favicon names

@robocoder commented on September 13th 2010 Contributor

(In [3149]) refs #1694 - applied lossy {} tld to 123people, google, lycos, and yahoo

@sgiehl commented on September 13th 2010 Member

Replying to vipsoft:
some google entries are duplicates and can be removed (line 381 & 382)

@robocoder commented on September 14th 2010 Contributor

(In [3150]) refs #1694

@robocoder commented on September 14th 2010 Contributor

(In [3151]) refs #1694 - lossy Bing images URL

@robocoder commented on September 15th 2010 Contributor

Note: users who view a cached page from Bing search results will result in a pageview on cc.bingj.com. I've suggested that they add the original web site's URL (uuencoded, of course) to the link. That way we can parse it out (similar to webcache.googleusercontent.com).

@robocoder commented on September 17th 2010 Contributor

(In [3161]) refs #1694 - add bing cache

@robocoder commented on September 17th 2010 Contributor

I'm thinking of adding a hook so plugins can implement their own search engine detection as there are requests for sites to be added that don't quite fit the traditional definition of a search engine.

@robocoder commented on September 20th 2010 Contributor

(In [3162]) refs #1694 - remove fix-up for webcache.googleusercontent.com; moving the logic to piwik.js

@robocoder commented on September 21st 2010 Contributor

(In [3163]) refs #739, refs #1694 - instead of a bing cache buster, detect when page is loaded from google or bing cache, and apply a fix to the url

@robocoder commented on September 21st 2010 Contributor

(In [3164]) refs #739, refs #1694 - fallback to the cache url if we can't parse it

@robocoder commented on September 21st 2010 Contributor

(In [3165]) refs #1694, refs #739 - make cacheFixup() testable

@robocoder commented on September 21st 2010 Contributor

Yahoo's Bing-powered search has an even weirder cache url.

@robocoder commented on September 21st 2010 Contributor

(In [3167]) refs #1694, refs #739 - Yahoo's cache result is served from Inktomi allocated ip addresses

@robocoder commented on September 21st 2010 Contributor

(In [3168]) fixes #1694 - misc fixes

  • if iconv fails, we use the original key
  • use mb_strtolower if available; apply this conversion after iconv
  • provide an icon when www.google.com/cse
@mattab commented on November 16th 2010 Member

Great work :) this will make maintenance a lot less tedious.

Is there a reason www.google.cat is still listed, or can it be removed?

@robocoder commented on November 16th 2010 Contributor

Technically, .cat isn't an ISO country code. But since I've already added the MaxMind codes to Countries.php, I guess it won't hurt to add this one too.

@robocoder commented on November 16th 2010 Contributor

(In [3319]) refs #1694 - treat .cat as a pseudo country tld

This Issue was closed on November 16th 2010
Powered by GitHub Issue Mirror