Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for import_logs.py #300

Merged
merged 12 commits into from Jun 7, 2014
Merged

Conversation

cbonte
Copy link
Contributor

@cbonte cbonte commented May 29, 2014

This set of patches addresses some optimizations in the log-analytics script.
Tests were made on an extract of 1 million lines from logs in ncsa format (dry-run).

Before the optimizations :
Total time: 219 seconds
Requests imported per second: 4554.54 requests per second

After :
Total time: 72 seconds
Requests imported per second: 13717.18 requests per second

@cbonte cbonte changed the title Perf log analytics Optimizations for import_logs.py May 31, 2014
@@ -1704,13 +1723,11 @@ def invalid_line(line, reason):
invalid_line(line, 'invalid encoding')
continue

# Check if the hit must be excluded.
if all((method(hit) for method in self.check_methods)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this if statement that checks for valid requests, was removed in your pull request, maybe this could explain some of the parsing CPU improvements?

@cbonte
Copy link
Contributor Author

cbonte commented Jun 2, 2014

Hi, most of the CPU improvements comes from the other commits.
For example, caching dates on a high traffic website allowed to double the lines per second, the other important commit it the use of set data structures instead of lists.

The diff you're quoting is from commit 63d000f, I may miss something but I think it's redundant with the same (inverted) check made before the date check.

@mattab
Copy link
Member

mattab commented Jun 2, 2014

Ok you are right, your commit is valid, I missed the other redundant check!

@cbay
Copy link
Contributor

cbay commented Jun 2, 2014

The rest of the pull request looks fine for me.

mattab added a commit that referenced this pull request Jun 3, 2014
@mattab
Copy link
Member

mattab commented Jun 3, 2014

FYI I've asked Travis if we can run the build on Python 2.6 instead of current 2.7.3. Will let you know when they reply....

@cbonte
Copy link
Contributor Author

cbonte commented Jun 3, 2014

OK thanks, I think I have still some work to do in order to not break upgrade on servers with python 2.6.
For python 2.6, I use a 3rd party library : https://pypi.python.org/pypi/ordereddict

Maybe one solution would be to fallback to no cache if we're in python 2.6 and the library is not installed, and add some documentation about that requirement (README.md + a non fatal warning ?)

@cbay
Copy link
Contributor

cbay commented Jun 3, 2014

@cbonte Thanks for the details. What's the point of running a dry-run on a 100 million hits file, though? I'm genuinely intrigued for your use-case.

Considering you have a very specific use-case IMHO, we could not merge this change BUT create a parse_date function that you could easily override locally. The import_logs scripts was specifically designed to be easy to override.

@mattab, what do you want to do?

PS: why concatenate the date and timezone to create the key rather than use a tuple?

@mattab
Copy link
Member

mattab commented Jun 3, 2014

I vote for removing the date optimization, if it's really small difference when not using --dry-run (especially if it helps make it run on Python 2.6?)

@cbonte
Copy link
Contributor Author

cbonte commented Jun 3, 2014

Honestly, in our usage, this is not a small difference. We are currently studying the migration from urchin to piwik for all of our websites, most of them generating more than 100 millions hits per day (the logs I used concerned a low traffic day).
I'm quite sure we'll be able to migrate one day, but currently the difference is too large (daily logs are integrated in 5h whith urchin, on a quite old server not heavily tuned, compared to more than 2 days with piwik).
I think that everywhere we can save cpu cycles, we should try to optimize the process.

@cbonte
Copy link
Contributor Author

cbonte commented Jun 3, 2014

@cbay dry-run is used on some websites to profile them with piwik. We are currently in a study phase to migrate some high traffic website from urchin to piwik. dry-run is heavily used to have some quick statistics and compare with urchin ones. In a migration process, this is quite helpful, at least in our case, I hope it can be helpful to others and to the piwik team in order to promote it.

@cbay
Copy link
Contributor

cbay commented Jun 3, 2014

@cbonte I understand your use case, but since it's far from being typical IMHO and it adds complexity, I'm reluctant to merge it.

Regarding performance, have you tried PyPy?

@cbonte
Copy link
Contributor Author

cbonte commented Jun 3, 2014

Hi again cbay,
I'm really not convinced we are in an untypical use case, but in contrary this will become more and more a need for piwik, as it is more and more visible. I don't think there is any over complexity, as soon as I fix the compatilibity with the older mode (Pyhon 2.6 standalone).

Concerning performance with PyPi, I didn't have time today to reprocuce it on the same hardware, but launching the bench on a openvz container with 4 vcpus, with a Debian Squeeze and Python 2.6.6 (hardware is built on 2 physical Intel(R) Xeon(R) CPU X5670 @ 2.93GHz / 6 cores / 12 threads), i could obtain such results :
cache : 871.344 ms
nocache : 28586.827 ms

mattab added a commit that referenced this pull request Jun 4, 2014
…upported version

(best practise is to test at least the minimum supported version)

Refs #300
@mattab
Copy link
Member

mattab commented Jun 4, 2014

@cbonte before you make the script compatible with Python 2.6, could you please re-merge with master, and push a trivial change, just to trigger the build again ? I expect that the build should fail, since the code is not compatible with Python 2.6 and the builds should now use python 2.6

@cbonte
Copy link
Contributor Author

cbonte commented Jun 5, 2014

@mattab I rebased the branch, which triggered a build (see commit 9b6b1f6). Then I could add the option do disable the cache for Python 2.6 without OrderedDict, I think it's ready now, feel free to review the code.
Let me know if you want some changes. Thanks ;-)

@mattab
Copy link
Member

mattab commented Jun 6, 2014

Then I could add the option do disable the cache for Python 2.6 without OrderedDict

Would such option have any benefit at all? we try to avoid adding option or setting unless they are absolutely necessary.

PR looks good to me
@cbay are you happy with it too?

@cbonte
Copy link
Contributor Author

cbonte commented Jun 6, 2014

hi, sorry I didn't mean a new command line option, it was about setting the cache optional, not active if OrderedDict is not present instead of raising an exception, as it's done in the latest commits.

@mattab
Copy link
Member

mattab commented Jun 6, 2014

Do you know if some users don't have OrderedDict present in their python setup?

If so, then it would be best to make it optional (instead of raising exception, silently ignore the cache and use no-cache).

@cbonte
Copy link
Contributor Author

cbonte commented Jun 6, 2014

@mattab I'd say that most users with python 2.6 are concerned, that's why i already made it optional (no exception raised but a message is displayed : I can make it completely silent if you prefer).

@mattab
Copy link
Member

mattab commented Jun 6, 2014

because it's optional and does not change anything for 99.99% of users, Silence is definitely better 👍

check_methods are called twice for each hit. The first ones are sufficient to
decide if the hit should be excluded or not.
cbay reported that set comprehension was available only in python 2.7+.
This patch fixes the syntax to keep backward compatibility with python 2.6.
Fallback to a non cached dates when OrderedDict is not available.
It can occur with python < 2.7 and Pypi OrderedDict is not installed.
As suggested by cbay, the cache key can be a tuple instead of a string
concatenation.
@cbonte
Copy link
Contributor Author

cbonte commented Jun 6, 2014

Modification done and I rebased the branch from master, hoping it would fix the Travis tests but it still fails and I see that master is also failing.
I'll rebase one more time once tests pass on master.

mattab pushed a commit that referenced this pull request Jun 7, 2014
Optimizations for import_logs.py Fixes #5314

Kuddos for nice pull request! We hope to see more contributions from you in the future :)
@mattab mattab merged commit e2808bd into matomo-org:master Jun 7, 2014
mattab added a commit that referenced this pull request Aug 13, 2014
…pageviews to be lost when importing big log files.

This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests)
refs #300
mattab added a commit to matomo-org/matomo-log-analytics that referenced this pull request Mar 17, 2015
…pageviews to be lost when importing big log files.

This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests)
refs matomo-org/matomo#300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants