Optimizations for import_logs.py #300

cbonte · 2014-05-29T21:57:47Z

This set of patches addresses some optimizations in the log-analytics script.
Tests were made on an extract of 1 million lines from logs in ncsa format (dry-run).

Before the optimizations :
Total time: 219 seconds
Requests imported per second: 4554.54 requests per second

After :
Total time: 72 seconds
Requests imported per second: 13717.18 requests per second

mattab · 2014-06-02T03:38:14Z

misc/log-analytics/import_logs.py

@@ -1704,13 +1723,11 @@ def invalid_line(line, reason):
                    invalid_line(line, 'invalid encoding')
                    continue

-            # Check if the hit must be excluded.
-            if all((method(hit) for method in self.check_methods)):


Looks like this if statement that checks for valid requests, was removed in your pull request, maybe this could explain some of the parsing CPU improvements?

cbonte · 2014-06-02T06:46:07Z

Hi, most of the CPU improvements comes from the other commits.
For example, caching dates on a high traffic website allowed to double the lines per second, the other important commit it the use of set data structures instead of lists.

The diff you're quoting is from commit 63d000f, I may miss something but I think it's redundant with the same (inverted) check made before the date check.

mattab · 2014-06-02T07:01:20Z

Ok you are right, your commit is valid, I missed the other redundant check!

cbay · 2014-06-02T09:34:38Z

The rest of the pull request looks fine for me.

refs #300

mattab · 2014-06-03T01:19:21Z

FYI I've asked Travis if we can run the build on Python 2.6 instead of current 2.7.3. Will let you know when they reply....

cbonte · 2014-06-03T08:45:51Z

OK thanks, I think I have still some work to do in order to not break upgrade on servers with python 2.6.
For python 2.6, I use a 3rd party library : https://pypi.python.org/pypi/ordereddict

Maybe one solution would be to fallback to no cache if we're in python 2.6 and the library is not installed, and add some documentation about that requirement (README.md + a non fatal warning ?)

cbay · 2014-06-03T09:17:34Z

@cbonte Thanks for the details. What's the point of running a dry-run on a 100 million hits file, though? I'm genuinely intrigued for your use-case.

Considering you have a very specific use-case IMHO, we could not merge this change BUT create a parse_date function that you could easily override locally. The import_logs scripts was specifically designed to be easy to override.

@mattab, what do you want to do?

PS: why concatenate the date and timezone to create the key rather than use a tuple?

mattab · 2014-06-03T11:05:00Z

I vote for removing the date optimization, if it's really small difference when not using --dry-run (especially if it helps make it run on Python 2.6?)

cbonte · 2014-06-03T12:47:25Z

Honestly, in our usage, this is not a small difference. We are currently studying the migration from urchin to piwik for all of our websites, most of them generating more than 100 millions hits per day (the logs I used concerned a low traffic day).
I'm quite sure we'll be able to migrate one day, but currently the difference is too large (daily logs are integrated in 5h whith urchin, on a quite old server not heavily tuned, compared to more than 2 days with piwik).
I think that everywhere we can save cpu cycles, we should try to optimize the process.

cbonte · 2014-06-03T12:51:54Z

@cbay dry-run is used on some websites to profile them with piwik. We are currently in a study phase to migrate some high traffic website from urchin to piwik. dry-run is heavily used to have some quick statistics and compare with urchin ones. In a migration process, this is quite helpful, at least in our case, I hope it can be helpful to others and to the piwik team in order to promote it.

cbay · 2014-06-03T13:05:01Z

@cbonte I understand your use case, but since it's far from being typical IMHO and it adds complexity, I'm reluctant to merge it.

Regarding performance, have you tried PyPy?

cbonte · 2014-06-03T13:58:17Z

Hi again cbay,
I'm really not convinced we are in an untypical use case, but in contrary this will become more and more a need for piwik, as it is more and more visible. I don't think there is any over complexity, as soon as I fix the compatilibity with the older mode (Pyhon 2.6 standalone).

Concerning performance with PyPi, I didn't have time today to reprocuce it on the same hardware, but launching the bench on a openvz container with 4 vcpus, with a Debian Squeeze and Python 2.6.6 (hardware is built on 2 physical Intel(R) Xeon(R) CPU X5670 @ 2.93GHz / 6 cores / 12 threads), i could obtain such results :
cache : 871.344 ms
nocache : 28586.827 ms

refs: #300 source: http://askubuntu.com/questions/125342/how-can-i-install-python-2-6-on-12-04

…upported version (best practise is to test at least the minimum supported version) Refs #300

mattab · 2014-06-04T02:58:00Z

@cbonte before you make the script compatible with Python 2.6, could you please re-merge with master, and push a trivial change, just to trigger the build again ? I expect that the build should fail, since the code is not compatible with Python 2.6 and the builds should now use python 2.6

cbonte · 2014-06-05T13:10:13Z

@mattab I rebased the branch, which triggered a build (see commit 9b6b1f6). Then I could add the option do disable the cache for Python 2.6 without OrderedDict, I think it's ready now, feel free to review the code.
Let me know if you want some changes. Thanks ;-)

mattab · 2014-06-06T03:35:03Z

Then I could add the option do disable the cache for Python 2.6 without OrderedDict

Would such option have any benefit at all? we try to avoid adding option or setting unless they are absolutely necessary.

PR looks good to me
@cbay are you happy with it too?

cbonte · 2014-06-06T06:09:58Z

hi, sorry I didn't mean a new command line option, it was about setting the cache optional, not active if OrderedDict is not present instead of raising an exception, as it's done in the latest commits.

mattab · 2014-06-06T06:28:52Z

Do you know if some users don't have OrderedDict present in their python setup?

If so, then it would be best to make it optional (instead of raising exception, silently ignore the cache and use no-cache).

cbonte · 2014-06-06T06:39:10Z

@mattab I'd say that most users with python 2.6 are concerned, that's why i already made it optional (no exception raised but a message is displayed : I can make it completely silent if you prefer).

mattab · 2014-06-06T06:46:06Z

because it's optional and does not change anything for 99.99% of users, Silence is definitely better 👍

check_methods are called twice for each hit. The first ones are sufficient to decide if the hit should be excluded or not.

cbay reported that set comprehension was available only in python 2.7+. This patch fixes the syntax to keep backward compatibility with python 2.6.

Fallback to a non cached dates when OrderedDict is not available. It can occur with python < 2.7 and Pypi OrderedDict is not installed.

As suggested by cbay, the cache key can be a tuple instead of a string concatenation.

…ork branch

cbonte · 2014-06-06T19:56:16Z

Modification done and I rebased the branch from master, hoping it would fix the Travis tests but it still fails and I see that master is also failing.
I'll rebase one more time once tests pass on master.

Optimizations for import_logs.py Fixes #5314 Kuddos for nice pull request! We hope to see more contributions from you in the future :)

…pageviews to be lost when importing big log files. This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests) refs #300

…pageviews to be lost when importing big log files. This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests) refs matomo-org/matomo#300

cbonte changed the title ~~Perf log analytics~~ Optimizations for import_logs.py May 31, 2014

mattab reviewed Jun 2, 2014
View reviewed changes

mattab added a commit that referenced this pull request Jun 3, 2014

Check which version of Python Travis CI runs with.

89d633d

refs #300

mattab added a commit that referenced this pull request Jun 4, 2014

Enabling Python 2.6 on Travis CI

2ee9bda

refs: #300 source: http://askubuntu.com/questions/125342/how-can-i-install-python-2-6-on-12-04

mattab added a commit that referenced this pull request Jun 4, 2014

On Travis CI test server, use Python 2.6, since this is our minimum s…

730ed5f

…upported version (best practise is to test at least the minimum supported version) Refs #300

cbonte added 7 commits June 6, 2014 21:30

add a cache for parsed dates

b4d4cd9

compute data only if not in dry-run

17cd069

retrieve the path extension only once

60aaa24

use sets data structures to optimize lookups

bbb9ad5

remove redundant exclusion check

1b9ed73

check_methods are called twice for each hit. The first ones are sufficient to decide if the hit should be excluded or not.

fix date key computation when timezone is missing

26cbba7

preserve compatibility with python 2.6

09c0246

cbay reported that set comprehension was available only in python 2.7+. This patch fixes the syntax to keep backward compatibility with python 2.6.

cbonte added 5 commits June 6, 2014 21:30

disable cache when OrderedDict is not available

70f779a

Fallback to a non cached dates when OrderedDict is not available. It can occur with python < 2.7 and Pypi OrderedDict is not installed.

add some documentation about PyPi OrderedDict

b15f42f

use a tuple ase the cache key instead of string concatenation

b9d70f1

As suggested by cbay, the cache key can be a tuple instead of a string concatenation.

add the date in the Hit attributes, forgotten during the merge of a w…

3d19bc0

…ork branch

silently fail if OrderedDict is not available for python 2.6

1cf60f2

mattab pushed a commit that referenced this pull request Jun 7, 2014

Merge pull request #300 from cbonte/perf-log-analytics

e2808bd

Optimizations for import_logs.py Fixes #5314 Kuddos for nice pull request! We hope to see more contributions from you in the future :)

mattab merged commit e2808bd into matomo-org:master Jun 7, 2014

mattab mentioned this pull request Jun 7, 2014

Performance improvements to Log Analytics #5314

Closed

mattab mentioned this pull request Jul 8, 2014

Performance improvements to Log Analytics matomo-org/matomo-log-analytics#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for import_logs.py #300

Optimizations for import_logs.py #300

cbonte commented May 29, 2014

mattab Jun 2, 2014

cbonte commented Jun 2, 2014

mattab commented Jun 2, 2014

cbay commented Jun 2, 2014

mattab commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbay commented Jun 3, 2014

mattab commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbay commented Jun 3, 2014

cbonte commented Jun 3, 2014

mattab commented Jun 4, 2014

cbonte commented Jun 5, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014

Optimizations for import_logs.py #300

Optimizations for import_logs.py #300

Conversation

cbonte commented May 29, 2014

mattab Jun 2, 2014

Choose a reason for hiding this comment

cbonte commented Jun 2, 2014

mattab commented Jun 2, 2014

cbay commented Jun 2, 2014

mattab commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbay commented Jun 3, 2014

mattab commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbonte commented Jun 3, 2014

cbay commented Jun 3, 2014

cbonte commented Jun 3, 2014

mattab commented Jun 4, 2014

cbonte commented Jun 5, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014

mattab commented Jun 6, 2014

cbonte commented Jun 6, 2014