Commit Graph

947 Commits

Author SHA1 Message Date
Michael Herzberg dbeba9e9bf Use a lock to mix forum downloads into the parallel mode. 2018-04-21 13:59:33 +01:00
Michael Herzberg aee916a629 Moved setting of forkserver further outwards... 2018-04-15 16:26:26 +01:00
Michael Herzberg ff78f8e7d8 Fixed missing parameter. 2018-04-12 23:25:31 +01:00
Michael Herzberg a758134c97 Readded mimetype from mimetypes. TODO: add mysql columns 2018-04-11 16:52:22 +01:00
Michael Herzberg 87b2847c6e Make ProcessPool and pystuck the default (for now). 2018-04-11 15:39:23 +01:00
Michael Herzberg cd09e2509d Removed retry of worker exceptions; instead, properly log them similary to tar and sql exceptions. 2018-04-11 15:38:32 +01:00
Michael Herzberg 22dc8f8263 Added --pystuck option to start pystuck servers for all processes. 2018-04-11 15:15:52 +01:00
Michael Herzberg 46494ec18b Re-setup logging in new processes. 2018-04-10 18:19:12 +01:00
Michael Herzberg 410fa3cf1c Moved setting of forkserver to prevent multiple invocations. 2018-04-10 17:24:10 +01:00
Michael Herzberg 12bdc1b00f Don't crash if something is wrong with the etag file. 2018-04-10 16:32:12 +01:00
Michael Herzberg 385003771a Set chunksize, maxtasksperchild, and max_tasks to 100. 2018-04-10 16:23:22 +01:00
Michael Herzberg bbe575d07b Pebble: start processing results right away. 2018-04-10 16:15:33 +01:00
Michael Herzberg 6bee81b711 Use forkserver. 2018-04-10 16:13:31 +01:00
Michael Herzberg 778736e2d3 Fixed logging of if-modified-since. 2018-04-10 10:55:03 +01:00
Michael Herzberg f677258f83 Added use of garbage collector. 2018-04-10 10:51:33 +01:00
Michael Herzberg d27106d7a9 Added creation of separate .etag files outside the .tar file. 2018-04-09 19:42:41 +01:00
Michael Herzberg 50b598993f Bugfix: actually download forums on sequential run. 2018-04-09 18:38:51 +01:00
Michael Herzberg f4c0ff56ff Use magic for mimetypes and don't attempt text-based analyses on binary resources. 2018-04-09 14:25:47 +01:00
Michael Herzberg fcfa58fb3d Wheel needs to be installed before ExtensionCrawler. 2018-04-09 00:14:07 +01:00
Achim D. Brucker 0c70b2e20b Increase number of parallel downloads. 2018-04-08 22:45:56 +01:00
Michael Herzberg 3d136daae3 Various small bug fixes. 2018-04-08 17:44:59 +01:00
Michael Herzberg faa2214af4 Timeout must be an integer. 2018-04-08 13:10:26 +01:00
Achim D. Brucker 33898a4cf3 Updated help text. 2018-04-08 10:10:30 +01:00
Achim D. Brucker e1ef0758f7 Made the choice of Pool vs. ProcessPool a configuration option. 2018-04-08 10:06:26 +01:00
Achim D. Brucker 70b64616e1 Ensure the use of /usr/bin/mail. 2018-04-08 09:59:03 +01:00
Achim D. Brucker 7f71a40ff4 Configured number of parallel processes. 2018-04-07 21:14:36 +01:00
Achim D. Brucker 66023b6b72 Reverted test of ThreadPools. 2018-04-07 21:13:32 +01:00
Achim D. Brucker a75380b0c5 Merge branch 'production' of logicalhacking.com:BrowserSecurity/ExtensionCrawler into production 2018-04-07 19:49:03 +01:00
Achim D. Brucker 987236958e Testing ThreadPools. 2018-04-07 19:48:45 +01:00
Achim D. Brucker c3d8de9b81 Testing ThreadPools. 2018-04-07 19:37:55 +01:00
Achim D. Brucker a3c60c0ae8 Ensure that mail recipient is defined. 2018-04-07 17:54:15 +01:00
Achim D. Brucker a7f0b26ead Log memory usage. 2018-04-07 16:26:00 +01:00
Achim D. Brucker 2fc154d643 Use UTC-based time/dates for logging. 2018-04-07 15:54:39 +01:00
Achim D. Brucker 91a76091e3 Use UTC-based time/dates for logging. 2018-04-07 15:42:29 +01:00
Achim D. Brucker c7d28d2c9e Merge branch 'production' of logicalhacking.com:BrowserSecurity/ExtensionCrawler into production 2018-04-07 13:29:48 +01:00
Achim D. Brucker f6a9d49da1 Reverted processing in chunks back into processing only one large list. 2018-04-07 13:17:33 +01:00
Michael Herzberg 558bff402a Removed --writable flag from read-only ExtensionCrawler image. 2018-04-07 00:42:39 +01:00
Michael Herzberg 0c3423dcd8 Fitted db connection log messages into our logging framework. 2018-04-07 00:42:39 +01:00
Michael Herzberg 9c1d48fcbe Added 'wheel' to dependencies to fix build error with simhash. 2018-04-07 00:42:39 +01:00
Achim D. Brucker 7756ad2963 Bug fix: actually use max_workers. 2018-04-06 23:04:01 +01:00
Achim D. Brucker 6a86b37e7c Increase number of parallel downloads. 2018-04-06 21:36:11 +01:00
Achim D. Brucker 14a30a570d Process extensions in chunks. 2018-04-06 21:34:09 +01:00
Achim D. Brucker d5df43c5c3 Moved heuristic for parallel download into separate method. 2018-04-06 20:32:24 +01:00
Achim D. Brucker 9434df1b28 Set max task to 100. 2018-04-06 16:37:49 +01:00
Achim D. Brucker 69f1618db2 Reduced number of parallel downloads, as pebble seems to be much more memory hungry ... 2018-04-06 13:34:36 +01:00
Achim D. Brucker d3fe5e758a New default download timeout to 2 hours. 2018-04-06 12:08:02 +01:00
Achim D. Brucker d9fc65a089 Reformatting. 2018-04-06 07:27:57 +01:00
Achim D. Brucker 8c9aab8216 Converted timeout into a proper configuration parameter. 2018-04-06 07:25:21 +01:00
Achim D. Brucker 9586eed280 Added documentation. 2018-04-06 07:18:15 +01:00
Achim D. Brucker fd9cc1855a Improved command line interface for selecting which type of extensiosn should be crawled. 2018-04-06 07:17:20 +01:00