Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure case insensitivity of the the Page URL segment type #8584

Closed
wants to merge 6 commits into from
Closed

ensure case insensitivity of the the Page URL segment type #8584

wants to merge 6 commits into from

Conversation

halabuda
Copy link

hello hello! the case sensitivity of the the Page URL segment type was not an issue with the field itself, which actually is of type utf8_general_ci, but rather with the CRC32 hashing of the field. upper/lower case variations of the same fundamental URL will produce unique hashes. this is a simple conditional fix to ignore the hash field when retrieving records for the Page URL segment type.

@halabuda halabuda changed the title fix for issue #8579 ensure case insensitivity of the the Page URL segment type Aug 17, 2015
@mattab
Copy link
Member

mattab commented Aug 17, 2015

Hi @halabuda - thanks for the PR.

Fyi, we can't merge this because of performance regression. there is no index on the field name in log_action so this change would result in performance slowdown. Only index on the log_action table is: INDEX index_type_hash (type, hashs)

I'm not sure how we can fix #8579

edit: maybe we could create the hash based of the lowercase string?

@halabuda
Copy link
Author

ah yes i see. i thought about fixing it that way initially but my concern was with preserving the hash field entropy for all the use cases when a case sensitive lookup was desirable. but forcing the hash to lower on insert/select makes sense to me if you don't think my aforementioned concern is a problem. let me know and i would be glad to spend some time on this today.

@tsteur
Copy link
Member

tsteur commented Aug 18, 2015

It would be nice if we did not have to create the hash based on lowercase as there will be no way back. I mean once we change the URL to lowercase we won't be possible to get the originally used URL at some point. Eg in the UI we would display the URL lowercase as well but this might not be what everyone wants.

Also I think we would possibly create many duplicated actions and would have to modify some reports maybe to merge duplicated urls to one url. Otherwise in the report there might appear eg a /Test and a /test when requesting eg week, month, year, range and on the day of the update.

Also not sure how this would affect label feature etc. Would be nice to find a solution that only modifies the segment query but that's probably not possible. Not sure how to fix

@halabuda
Copy link
Author

@tsteur i agree with you. i don't like the "no way back" aspect either.

i am confused regarding your comment about duplicated actions. with case sensitive tracking, as it works currently, a new log_action entry is created for each instance of /foo /Foo and /FOO. is this what you mean by "duplicated actions"? forcing the URL to lower before insertion would actually consolidate log_action records. whether or not this is desirable from a product design perspective i do not know.

@tsteur
Copy link
Member

tsteur commented Aug 18, 2015

a new log_action entry is created for each instance of /foo /Foo and /FOO. is this what you mean by "duplicated actions"

Yes, I mean in the log_action table that contains the name of the URL and the hash. I think we usually check whether a given hash already exists and if not, insert a new action

@halabuda
Copy link
Author

right, so given my example, 3 unique log_action entries would be created for the same effective URL of /foo because each case sensitive hash is unique, whereas a forced lower hash would consolidate all 3 URLs into one record because they would share a hash value. but which is the desired functionality?

what about the possibility of creating a secondary hash/index (ie. log_action.hash_ci)? actions would still be stored against the original hash along with their unmodified URL string, so case sensitivity would be preserved where desired, but then the hash_ci field could be used in certain instances where consolidation is necessary.

@tsteur
Copy link
Member

tsteur commented Aug 18, 2015

I guess to have only one is kinda desired functionality but it might be problematic as old data is not stored lowercase. This would lead in having duplicated actions which can be bad for performance and we might display wrong data. This needs to be tested. Nonetheless we currently might have already different versions of the same URL in the log_action table. I haven't had a chance to have a look at the code yet.

Maybe the hash could be based on lowercase but we store the actually sent URL (how it was sent the first time) to keep having upper case letters in the UI. I wouldn't really feel confident about storing all urls lowercase. In theory one could even serve a different page depending on lower/upper case characters.

hash_ci could be a solution but would need a lot of time to regenerate the index for all actions, the index size of the DB would get bigger etc. It's probably not realistic to do it

@tsteur
Copy link
Member

tsteur commented Aug 18, 2015

just to confirm I had a look at the code and tested sending eg 3 different versions of the same URL creates 3 actions in the log_action table which is also kinda right as eg case sensitivity in URL parameters etc could be important I think

@mattab
Copy link
Member

mattab commented Sep 8, 2015

Thanks for the PR but we can't merge it as is. We continue discussion in #8579

@mattab mattab closed this Sep 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants