@halabuda opened this Pull Request on August 17th 2015

hello hello! the case sensitivity of the the Page URL segment type was not an issue with the field itself, which actually is of type utf8_general_ci, but rather with the CRC32 hashing of the field. upper/lower case variations of the same fundamental URL will produce unique hashes. this is a simple conditional fix to ignore the hash field when retrieving records for the Page URL segment type.

@mattab commented on August 17th 2015 Owner

Hi @halabuda - thanks for the PR.

Fyi, we can't merge this because of performance regression. there is no index on the field name in log_action so this change would result in performance slowdown. Only index on the log_action table is: INDEX index_type_hash (type, hashs)

I'm not sure how we can fix #8579

edit: maybe we could create the hash based of the lowercase string?

@halabuda commented on August 17th 2015

ah yes i see. i thought about fixing it that way initially but my concern was with preserving the hash field entropy for all the use cases when a case sensitive lookup was desirable. but forcing the hash to lower on insert/select makes sense to me if you don't think my aforementioned concern is a problem. let me know and i would be glad to spend some time on this today.

@tsteur commented on August 18th 2015 Owner

It would be nice if we did not have to create the hash based on lowercase as there will be no way back. I mean once we change the URL to lowercase we won't be possible to get the originally used URL at some point. Eg in the UI we would display the URL lowercase as well but this might not be what everyone wants.

Also I think we would possibly create many duplicated actions and would have to modify some reports maybe to merge duplicated urls to one url. Otherwise in the report there might appear eg a /Test and a /test when requesting eg week, month, year, range and on the day of the update.

Also not sure how this would affect label feature etc. Would be nice to find a solution that only modifies the segment query but that's probably not possible. Not sure how to fix

@halabuda commented on August 18th 2015

@tsteur i agree with you. i don't like the "no way back" aspect either.

i am confused regarding your comment about duplicated actions. with case sensitive tracking, as it works currently, a new log_action entry is created for each instance of /foo /Foo and /FOO. is this what you mean by "duplicated actions"? forcing the URL to lower before insertion would actually consolidate log_action records. whether or not this is desirable from a product design perspective i do not know.

@tsteur commented on August 18th 2015 Owner

a new log_action entry is created for each instance of /foo /Foo and /FOO. is this what you mean by "duplicated actions"

Yes, I mean in the log_action table that contains the name of the URL and the hash. I think we usually check whether a given hash already exists and if not, insert a new action

@halabuda commented on August 18th 2015

right, so given my example, 3 unique log_action entries would be created for the same effective URL of /foo because each case sensitive hash is unique, whereas a forced lower hash would consolidate all 3 URLs into one record because they would share a hash value. but which is the desired functionality?

what about the possibility of creating a secondary hash/index (ie. log_action.hash_ci)? actions would still be stored against the original hash along with their unmodified URL string, so case sensitivity would be preserved where desired, but then the hash_ci field could be used in certain instances where consolidation is necessary.

@tsteur commented on August 18th 2015 Owner

I guess to have only one is kinda desired functionality but it might be problematic as old data is not stored lowercase. This would lead in having duplicated actions which can be bad for performance and we might display wrong data. This needs to be tested. Nonetheless we currently might have already different versions of the same URL in the log_action table. I haven't had a chance to have a look at the code yet.

Maybe the hash could be based on lowercase but we store the actually sent URL (how it was sent the first time) to keep having upper case letters in the UI. I wouldn't really feel confident about storing all urls lowercase. In theory one could even serve a different page depending on lower/upper case characters.

hash_ci could be a solution but would need a lot of time to regenerate the index for all actions, the index size of the DB would get bigger etc. It's probably not realistic to do it

@tsteur commented on August 18th 2015 Owner

just to confirm I had a look at the code and tested sending eg 3 different versions of the same URL creates 3 actions in the log_action table which is also kinda right as eg case sensitivity in URL parameters etc could be important I think

@mattab commented on September 8th 2015 Owner

Thanks for the PR but we can't merge it as is. We continue discussion in #8579

This Pull Request was closed on September 8th 2015
Powered by GitHub Issue Mirror