Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log import hangs in in non deterministic way #4472

Closed
anonymous-matomo-user opened this issue Jan 3, 2014 · 20 comments
Closed

log import hangs in in non deterministic way #4472

anonymous-matomo-user opened this issue Jan 3, 2014 · 20 comments
Labels
duplicate For issues that already existed in our issue tracker and were reported previously. Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@anonymous-matomo-user
Copy link

In my setup I use log import to load data into piwik. Everything seems to go well except for the fact that sometimes (and it doesn't seem to be repeatable) one of recorder threads hangs and the rest of the threads wait on futex.

For example - main process:

strace -p 24643

Process 24643 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 819826}) = 0 (Timeout)
gettimeofday({1388734405, 12820}, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>
Process 24643 detached

Hung thread:

strace -p 24646

Process 24646 attached - interrupt to quit
recvmsg(6, ^C <unfinished ...>
Process 24646 detached

Rest of the threads (all look alike):

strace -p 24645

Process 24645 attached - interrupt to quit
futex(0x7f8028001540, FUTEX_WAIT_PRIVATE, 0, NULL^C <unfinished ...>
Process 24645 detached

I tried to add a thread dump capability to the import_logs.py and so I got:
One of waiting threads (all are identical):

Thread: Thread-13(139637655840512)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1167, in _run_bulk
hits = self.queue.get()
File: "/usr/lib64/python2.6/Queue.py", line 168, in get
self.not_empty.wait()
File: "/usr/lib64/python2.6/threading.py", line 239, in wait
waiter.acquire()

Hung thread (I assume, that's the only one different):

Thread: Thread-7(139638058493696)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1170, in _run_bulk
self._record_hits(hits)

Main process:

Thread: Thread-1(139638191130368)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 820, in _monitor
time.sleep(config.options.show_progress_delay)

As far as I know it used to occur in 1.12 (although it used to be less frequent, I think) and occurs in 2.0.2 as well.

My installation works on CentOS 6.4

rpm -qa | grep python

python-ethtool-0.6-3.el6.x86_64
python-libs-2.6.6-37.el6_4.x86_64
python-setuptools-0.6.10-3.el6.noarch
python-devel-2.6.6-37.el6_4.x86_64
python-iniparse-0.3.1-2.1.el6.noarch
python-dateutil-1.4.1-6.el6.noarch
python-urlgrabber-3.9.1-8.el6.noarch
python-pycurl-7.19.0-8.el6.x86_64
rpm-python-4.8.0-32.el6.x86_64
python-2.6.6-37.el6_4.x86_64
python-pip-1.3.1-4.el6.noarch
newt-python-0.52.11-3.el6.x86_64
libxml2-python-2.7.6-12.el6_4.1.x86_64
libproxy-python-0.3.0-4.el6_3.x86_64

Keywords: log import

@anonymous-matomo-user
Copy link
Author

One more thing - it seems that the hang occurs always after all the file has already been read (but not whole of it has been recorded) or is at the very close to the end.

@mattab
Copy link
Member

mattab commented Jan 5, 2014

Thanks for the detailed report.. Maybe other users experience the issue as well. If you find the bug in the code, please comment here or pull request. Happy new year!

@anonymous-matomo-user
Copy link
Author

After few more occurrences it seems that I was wrong that it happened only at the end of the input file - it is not the case. It hangs as well in the middle of the log.
It seems also that lowering concurrency from 12 recorders too 6 lowers significantly probability of hanging. I have a daily intake of slightly over 4.5G into various files. I split the big ones into 50000-lines chunks, which gives me about 300-400 files to import.
With 12 recorders I had about 10 hangs per such process, maybe more. Now I get 1 or 2.
But it is still annoying since it stops the whole batch import process.

@anonymous-matomo-user
Copy link
Author

I think this is more related to the API rather than the import process (Or may be the import script itself). But when the import process hangs, restarting apache resolves the issue.

Also, whenever import script hangs the website also goes unavailable. So it seems more like an API + web issue rather than the import process issue itself.

@anonymous-matomo-user
Copy link
Author

Replying to shabeepk:

I think this is more related to the API rather than the import process (Or may be the import script itself). But when the import process hangs, restarting apache resolves the issue.

Also, whenever import script hangs the website also goes unavailable. So it seems more like an API + web issue rather than the import process issue itself.

This issue seems to occur when apache is not configured with mpm_prefork (which is recommended for php).

@mattab
Copy link
Member

mattab commented Feb 8, 2014

Maybe check your server error logs for any relevant error?

@anonymous-matomo-user
Copy link
Author

I can't confirm whether it stops website as well (lowered worker numbers in my scripts and now I just launch parsing few logs in parallel and since I have 3 workers I only had one hang so far, but the import time is significantly longer because it's determined by the longest file size).
I'll try to raise workers number again and see if it starts hanging.
But I'm pretty sure I'm using apache with MPM prefork.

$ curl http://piwik/server-info 2>/dev/null | grep MPM | sed -s 's/<[^>]\+>//g'
Timeouts: connection: 60 &nbsp;&nbsp; keep-alive: 15MPM Name: Prefork
MPM Information: Max Daemons: 256 Threaded: no Forked: yes
 -D APACHE_MPM_DIR="server/mpm/prefork"

@mattab
Copy link
Member

mattab commented Mar 4, 2014

the MPM error message is a server config issue, not Piwik bug. All the best to find solution for this one!

@anonymous-matomo-user
Copy link
Author

Come on. This is not "MPM error". It's just an excerpt from server-info page showing which MPM I'm using. It's configured well with prefork MPM.
And It's working OK.
And, additionaly, after I raised worker limit again - my import job hang while website remains avaliable.
There is nothing in server error logs.
Restarting apache doesn't change anything.
It seems that one of the threads hangs on (as I shown earlier):

# strace -p 15712
Process 15712 attached - interrupt to quit
recvmsg(4,

All other threads are just waiting on futexes.

@mattab
Copy link
Member

mattab commented Mar 6, 2014

What is happening AFAIK is that the request to your web server times out. It is not a Piwik issue but unless this is Piwik crashing in some ways. If you don't have anything in your error log, then likely it is server configuration issue.

what exact command do you use?

@anonymous-matomo-user
Copy link
Author

I'm not saying it's a piwik issue as the server part. I'm thinking of it more as an issue within the log importing script.

Notice that the process hangs on recvmsg(), not on select() or similar call. It means that if there is no more data to be sent by server, the receiving process (importing script) will hang forever. I suppose someone is trying to read from a socket without first checking whether there is any data available for reading. There is no timeout because the socket is still open.

The actual command I use is generated by a script:

MERGER=merge_files.py
IMPORTER=import_logs.py

eval find $SQUIDS -type f -name \*-access.log-$CURRENT_DATE -printf \'%f\\n\' |
    sort | uniq | xargs -P $CONCURRENCY_LIMIT -I {} bash -c "
      logfile={}
      LOGHOSTNAME=\${logfile%-access.log-*}
      eval $MERGER $SQUIDS/\$logfile |  
      $PYTHON_CMD $IMPORTER  \
        --token-auth=$AUTH_TOKEN \
        --add-sites-new-hosts \
        --recorders=$THREAD_LIMIT \
        --recorder-max-payload=1000 \
        --url=https://piwik  \
        --log-format-regex='(?P<ip>\S+) \- \- \[(?P<date>\S+)\s?\S+?\] \"\S+ http\:\/\/(?P<unusedhost>[\w\-\.]*)(?P<path>.*?) \S+\" (?P<status>\S+) (?P<length>\S+) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\" \S+'  \
        --enable-http-errors \
        --enable-http-redirects \
        --log-hostname=\$LOGHOSTNAME \
        --enable-static - "

@mattab
Copy link
Member

mattab commented Mar 7, 2014

Ok this may indeed be a bug. Maybe try with a payload of 100 instead of 1000 and see if the problem still occurs?

@anonymous-matomo-user
Copy link
Author

OK. I'll try to lower the payload then and see for few days if the problem persists (it occurs very unpredictively).

@anonymous-matomo-user
Copy link
Author

I lowered the payload to 100. Didn't help.
Got thread hung at recvmgsg().
Makes you wonder tho - if the import is hanging with many recorders and doesn't hang with few (or at least not as often), is the library you're using for querying the server thread-safe? I read that urllib2 is supposed to be behaving ok if you're not doing specific things with its object, but it was not written with thread safety in mind, so might be misbehaving in unpredictable way.

@mattab
Copy link
Member

mattab commented Mar 8, 2014

We still cannot replicate the hanging process issue. Likely there is a bug or misconfiguration but until we can replicate or if you can troubleshoot to find out what is causing it, we cannot really help yet. Maybe someone else will report the issue here? or maybe you find out how to reproduce...

@anonymous-matomo-user
Copy link
Author

I understand. The main problem is that I can't reproduce it myself in a deterministic way. As I said before - I can't predict whether it happens or not. It just sometimes does.

@anonymous-matomo-user
Copy link
Author

I managed to run my import script using strace. It seems that the issue arises when two threads are trying to query DNS servers at the same time. While one thread succeeds, another one hangs as I described earlier.
But the strange thing is that the socket (fd=12 in this case) is getting closed by this very thread and after that the import script still tries to use it.

08:37:19.920001 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 12
08:37:19.920253 fstat(12, {st_mode=S_IFREG|0644, st_size=83, ...}) = 0
08:37:19.920480 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbf7ff6e000
08:37:19.920661 read(12, "127.0.0.1   localhost.localdomai"..., 4096) = 83
08:37:19.921097 read(12, "", 4096) = 0
08:37:19.921548 close(12) = 0
08:37:19.922040 munmap(0x7fbf7ff6e000, 4096)      = 0
08:37:19.922591 futex(0x7fbf747bd240, FUTEX_WAKE_PRIVATE, 1)           = 0
08:37:19.923433 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP)                = 12
08:37:19.924190 connect(12, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
08:37:19.924692 gettimeofday({1395646639, 924719}, NULL) = 0
08:37:19.925232 poll([{fd=12, events=POLLOUT}], 1, 0)          = 1 ([{fd=12, revents=POLLOUT}])
08:37:19.925636 sendto(12, "\302\240\1\0\0\1\0\0\0\0\0\0\5piwik\2eo\2pl\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
08:37:19.926164 poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
08:37:19.927042 ioctl(12, FIONREAD, [147]) = 0
08:37:19.927876 recvfrom(12, "\302\240\205\200\0\1\0\1\0\3\0\3\5piwik\2ab\2ab\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 147
08:37:19.928919 close(12) = 0
08:37:19.929849 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
08:37:19.931274 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
08:37:19.933639 gettimeofday({1395646639, 933684}, NULL) = 0
08:37:19.935177 poll([{fd=12, events=POLLOUT}], 1, 4991) = 1 ([{fd=12, revents=POLLOUT}])
08:37:19.939369 sendto(12, ",W\1\0\0\1\0\0\0\0\0\0\5piwik\2ab\2ab\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
08:37:19.939872 gettimeofday({1395646639, 939872}, NULL) = 0
08:37:19.940067 poll([{fd=12, events=POLLIN}], 1, 4984) = 1 ([{fd=12, revents=POLLIN}])
08:37:19.940524 ioctl(12, FIONREAD, [0]) = -1 EOPNOTSUPP (Operation not supported)
08:37:19.941066 recvfrom(12,  

@anonymous-matomo-user anonymous-matomo-user added this to the 2.x - The Great Piwik 2.x Backlog milestone Jul 8, 2014
@mattab mattab removed the P: low label Aug 3, 2014
@anchalaggarwal
Copy link

Hi All,
Is there a timeline for this bug resolution? Till when can we expect a solution?

@mattab mattab modified the milestones: Mid term, Short term Oct 11, 2014
@mattab
Copy link
Member

mattab commented Dec 1, 2014

@anchalaggarwal unfortunately we don't know how to solve this problem. in our tests or when we run in production we haven't yet AFAIK had this issue. Maybe you or someone else has some idea?

@mattab mattab added Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. and removed Bug For errors / faults / flaws / inconsistencies etc. labels Dec 1, 2014
@mattab
Copy link
Member

mattab commented Mar 12, 2015

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163

@mattab mattab closed this as completed Mar 12, 2015
@mattab mattab added the duplicate For issues that already existed in our issue tracker and were reported previously. label Mar 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate For issues that already existed in our issue tracker and were reported previously. Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

3 participants