This set of patches addresses some optimizations in the log-analytics script.
Tests were made on an extract of 1 million lines from logs in ncsa format (dry-run).
Before the optimizations :
Total time: 219 seconds
Requests imported per second: 4554.54 requests per second
Total time: 72 seconds
Requests imported per second: 13717.18 requests per second
Hi, most of the CPU improvements comes from the other commits.
For example, caching dates on a high traffic website allowed to double the lines per second, the other important commit it the use of set data structures instead of lists.
The diff you're quoting is from commit 63d000f, I may miss something but I think it's redundant with the same (inverted) check made before the date check.
Ok you are right, your commit is valid, I missed the other redundant check!
The rest of the pull request looks fine for me.
FYI I've asked Travis if we can run the build on Python 2.6 instead of current 2.7.3. Will let you know when they reply....
OK thanks, I think I have still some work to do in order to not break upgrade on servers with python 2.6.
For python 2.6, I use a 3rd party library : https://pypi.python.org/pypi/ordereddict
Maybe one solution would be to fallback to no cache if we're in python 2.6 and the library is not installed, and add some documentation about that requirement (README.md + a non fatal warning ?)
@cbonte Thanks for the details. What's the point of running a dry-run on a 100 million hits file, though? I'm genuinely intrigued for your use-case.
Considering you have a very specific use-case IMHO, we could not merge this change BUT create a parse_date function that you could easily override locally. The import_logs scripts was specifically designed to be easy to override.
@mattab, what do you want to do?
PS: why concatenate the date and timezone to create the key rather than use a tuple?
I vote for removing the date optimization, if it's really small difference when not using --dry-run (especially if it helps make it run on Python 2.6?)
Honestly, in our usage, this is not a small difference. We are currently studying the migration from urchin to piwik for all of our websites, most of them generating more than 100 millions hits per day (the logs I used concerned a low traffic day).
I'm quite sure we'll be able to migrate one day, but currently the difference is too large (daily logs are integrated in 5h whith urchin, on a quite old server not heavily tuned, compared to more than 2 days with piwik).
I think that everywhere we can save cpu cycles, we should try to optimize the process.
@cbay dry-run is used on some websites to profile them with piwik. We are currently in a study phase to migrate some high traffic website from urchin to piwik. dry-run is heavily used to have some quick statistics and compare with urchin ones. In a migration process, this is quite helpful, at least in our case, I hope it can be helpful to others and to the piwik team in order to promote it.
@cbonte I understand your use case, but since it's far from being typical IMHO and it adds complexity, I'm reluctant to merge it.
Regarding performance, have you tried PyPy?
Hi again cbay,
I'm really not convinced we are in an untypical use case, but in contrary this will become more and more a need for piwik, as it is more and more visible. I don't think there is any over complexity, as soon as I fix the compatilibity with the older mode (Pyhon 2.6 standalone).
Concerning performance with PyPi, I didn't have time today to reprocuce it on the same hardware, but launching the bench on a openvz container with 4 vcpus, with a Debian Squeeze and Python 2.6.6 (hardware is built on 2 physical Intel(R) Xeon(R) CPU X5670 @ 2.93GHz / 6 cores / 12 threads), i could obtain such results :
cache : 871.344 ms
nocache : 28586.827 ms
@cbonte before you make the script compatible with Python 2.6, could you please re-merge with master, and push a trivial change, just to trigger the build again ? I expect that the build should fail, since the code is not compatible with Python 2.6 and the builds should now use python 2.6
@mattab I rebased the branch, which triggered a build (see commit 9b6b1f6). Then I could add the option do disable the cache for Python 2.6 without OrderedDict, I think it's ready now, feel free to review the code.
Let me know if you want some changes. Thanks ;-)
Then I could add the option do disable the cache for Python 2.6 without OrderedDict
Would such option have any benefit at all? we try to avoid adding option or setting unless they are absolutely necessary.
PR looks good to me
@cbay are you happy with it too?
hi, sorry I didn't mean a new command line option, it was about setting the cache optional, not active if OrderedDict is not present instead of raising an exception, as it's done in the latest commits.
Do you know if some users don't have OrderedDict present in their python setup?
If so, then it would be best to make it optional (instead of raising exception, silently ignore the cache and use no-cache).
@mattab I'd say that most users with python 2.6 are concerned, that's why i already made it optional (no exception raised but a message is displayed : I can make it completely silent if you prefer).
because it's optional and does not change anything for 99.99% of users, Silence is definitely better :+1:
Modification done and I rebased the branch from master, hoping it would fix the Travis tests but it still fails and I see that master is also failing.
I'll rebase one more time once tests pass on master.