@cbonte opened this Pull Request on May 29th 2014 Contributor

This set of patches addresses some optimizations in the log-analytics script.
Tests were made on an extract of 1 million lines from logs in ncsa format (dry-run).

Before the optimizations :
Total time: 219 seconds
Requests imported per second: 4554.54 requests per second

After :
Total time: 72 seconds
Requests imported per second: 13717.18 requests per second

@cbonte commented on June 2nd 2014 Contributor

Hi, most of the CPU improvements comes from the other commits.
For example, caching dates on a high traffic website allowed to double the lines per second, the other important commit it the use of set data structures instead of lists.

The diff you're quoting is from commit 63d000f, I may miss something but I think it's redundant with the same (inverted) check made before the date check.

@mattab commented on June 2nd 2014 Member

Ok you are right, your commit is valid, I missed the other redundant check!

@cbay commented on June 2nd 2014 Contributor

The rest of the pull request looks fine for me.

@mattab commented on June 3rd 2014 Member

FYI I've asked Travis if we can run the build on Python 2.6 instead of current 2.7.3. Will let you know when they reply....

@cbonte commented on June 3rd 2014 Contributor

OK thanks, I think I have still some work to do in order to not break upgrade on servers with python 2.6.
For python 2.6, I use a 3rd party library : https://pypi.python.org/pypi/ordereddict

Maybe one solution would be to fallback to no cache if we're in python 2.6 and the library is not installed, and add some documentation about that requirement (README.md + a non fatal warning ?)

@cbay commented on June 3rd 2014 Contributor

@cbonte Thanks for the details. What's the point of running a dry-run on a 100 million hits file, though? I'm genuinely intrigued for your use-case.

Considering you have a very specific use-case IMHO, we could not merge this change BUT create a parse_date function that you could easily override locally. The import_logs scripts was specifically designed to be easy to override.

@mattab, what do you want to do?

PS: why concatenate the date and timezone to create the key rather than use a tuple?

@mattab commented on June 3rd 2014 Member

I vote for removing the date optimization, if it's really small difference when not using --dry-run (especially if it helps make it run on Python 2.6?)

@cbonte commented on June 3rd 2014 Contributor

Honestly, in our usage, this is not a small difference. We are currently studying the migration from urchin to piwik for all of our websites, most of them generating more than 100 millions hits per day (the logs I used concerned a low traffic day).
I'm quite sure we'll be able to migrate one day, but currently the difference is too large (daily logs are integrated in 5h whith urchin, on a quite old server not heavily tuned, compared to more than 2 days with piwik).
I think that everywhere we can save cpu cycles, we should try to optimize the process.

@cbonte commented on June 3rd 2014 Contributor

@cbay dry-run is used on some websites to profile them with piwik. We are currently in a study phase to migrate some high traffic website from urchin to piwik. dry-run is heavily used to have some quick statistics and compare with urchin ones. In a migration process, this is quite helpful, at least in our case, I hope it can be helpful to others and to the piwik team in order to promote it.

@cbay commented on June 3rd 2014 Contributor

@cbonte I understand your use case, but since it's far from being typical IMHO and it adds complexity, I'm reluctant to merge it.

Regarding performance, have you tried PyPy?

@cbonte commented on June 3rd 2014 Contributor

Hi again cbay,
I'm really not convinced we are in an untypical use case, but in contrary this will become more and more a need for piwik, as it is more and more visible. I don't think there is any over complexity, as soon as I fix the compatilibity with the older mode (Pyhon 2.6 standalone).

Concerning performance with PyPi, I didn't have time today to reprocuce it on the same hardware, but launching the bench on a openvz container with 4 vcpus, with a Debian Squeeze and Python 2.6.6 (hardware is built on 2 physical Intel(R) Xeon(R) CPU X5670 @ 2.93GHz / 6 cores / 12 threads), i could obtain such results :
cache : 871.344 ms
nocache : 28586.827 ms

@mattab commented on June 4th 2014 Member

@cbonte before you make the script compatible with Python 2.6, could you please re-merge with master, and push a trivial change, just to trigger the build again ? I expect that the build should fail, since the code is not compatible with Python 2.6 and the builds should now use python 2.6

@cbonte commented on June 5th 2014 Contributor

@mattab I rebased the branch, which triggered a build (see commit 9b6b1f6). Then I could add the option do disable the cache for Python 2.6 without OrderedDict, I think it's ready now, feel free to review the code.
Let me know if you want some changes. Thanks ;-)

@mattab commented on June 6th 2014 Member

Then I could add the option do disable the cache for Python 2.6 without OrderedDict

Would such option have any benefit at all? we try to avoid adding option or setting unless they are absolutely necessary.

PR looks good to me
@cbay are you happy with it too?

@cbonte commented on June 6th 2014 Contributor

hi, sorry I didn't mean a new command line option, it was about setting the cache optional, not active if OrderedDict is not present instead of raising an exception, as it's done in the latest commits.

@mattab commented on June 6th 2014 Member

Do you know if some users don't have OrderedDict present in their python setup?

If so, then it would be best to make it optional (instead of raising exception, silently ignore the cache and use no-cache).

@cbonte commented on June 6th 2014 Contributor

@mattab I'd say that most users with python 2.6 are concerned, that's why i already made it optional (no exception raised but a message is displayed : I can make it completely silent if you prefer).

@mattab commented on June 6th 2014 Member

because it's optional and does not change anything for 99.99% of users, Silence is definitely better :+1:

@cbonte commented on June 6th 2014 Contributor

Modification done and I rebased the branch from master, hoping it would fix the Travis tests but it still fails and I see that master is also failing.
I'll rebase one more time once tests pass on master.

This Pull Request was closed on June 7th 2014
Powered by GitHub Issue Mirror