Commit Graph

709 Commits

Author SHA1 Message Date
Michael Herzberg 58aacef3ff Reopen connection after every exception. 2017-09-16 12:31:00 +01:00
Michael Herzberg a514c0001e Added check for empty crx files. 2017-09-16 12:14:41 +01:00
Michael Herzberg b51de8577f Added compression for mysql. 2017-09-16 12:04:35 +01:00
Achim D. Brucker 92e1c4c2e5 Skip deleted files. 2017-09-16 11:41:21 +01:00
Achim D. Brucker 082cd2fc65 Added hacking pull method that uses the regular git binary. While method will not work well with filenames containg spaces and there mit be other glitches, it allows to pull an update of the cdnjs git reposistory (> 100GB) within a couple of minutes compared to a couple of days that the non hackish solution needs. 2017-09-16 11:36:40 +01:00
Achim D. Brucker 5d3343acf1 Refactoring: moved git_repo creation into pull_get_list_changed_files(...). 2017-09-16 10:33:11 +01:00
Achim D. Brucker 7b0e63da10 Implemented n/N options for external parallelisation (only for fresh initialization). 2017-09-15 22:40:46 +01:00
Michael Herzberg a1781b9ff9 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-15 21:32:25 +01:00
Michael Herzberg 1814b1738a Added email notifications on abort. 2017-09-15 21:32:12 +01:00
Achim D. Brucker 400e74ae3f Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-15 20:21:45 +01:00
Achim D. Brucker 26678636eb Ignore commits where blobs are None. 2017-09-15 20:21:05 +01:00
Michael Herzberg 85680d360b Automatically reopen database connection on failure. 2017-09-15 18:23:25 +01:00
Michael Herzberg ddbbc2672d Try to insert also other data if some inserts fail. Use autocommit to prevent data loss on retries. 2017-09-15 18:15:03 +01:00
Michael Herzberg c57bce2491 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-15 17:42:05 +01:00
Achim D. Brucker 936f2d3189 Log git info before starting pull (update). 2017-09-14 22:54:37 +01:00
Achim D. Brucker 2ff30f7382 Parallel execution of git date queries. 2017-09-14 15:11:53 +01:00
Achim D. Brucker 12a1e282aa The method pull_get_updated_lib_files(...) now also returns unique library/version information. 2017-09-14 10:44:30 +01:00
Achim D. Brucker e3f1202e44 Use version dictionary. 2017-09-14 10:33:00 +01:00
Achim D. Brucker f54f29c9ba Added build_release_date_dic(...). 2017-09-14 09:50:09 +01:00
Achim D. Brucker 3b217922c5 Added line count. 2017-09-13 16:41:01 +01:00
Achim D. Brucker 420eec7462 Minor memory optimizations. 2017-09-13 11:12:33 +01:00
Achim D. Brucker ec1c47625a Added support for parallel update of database. 2017-09-13 09:13:35 +01:00
Achim D. Brucker c386bd01dd Added missing string conversion. 2017-09-13 08:29:23 +01:00
Achim D. Brucker 42e685ee32 Added missing string conversion. 2017-09-13 08:01:02 +01:00
Achim D. Brucker 18fb23d3dc Use glob instead of os.walk() to avoid memory leak in the latter. 2017-09-13 04:04:38 +01:00
Achim D. Brucker 76d5993794 Added logging output. 2017-09-13 03:02:39 +01:00
Achim D. Brucker c30f7fdd7c Implemented skeleton of main routine. 2017-09-13 02:56:13 +01:00
Achim D. Brucker a8a5534be1 Renamed module. 2017-09-13 01:13:17 +01:00
Achim D. Brucker bdb84c2120 Renamed module. 2017-09-13 01:09:30 +01:00
Achim D. Brucker 4e5b52617f Catch exception during decompression and increase max. allowed size of decompressed data to 100 times of compressed size. 2017-09-13 00:23:17 +01:00
Achim D. Brucker 88efe2b8a4 Reformatting. 2017-09-13 00:02:20 +01:00
Achim D. Brucker ea9339bc53 Compute data identifiers for uncompressed content of gzip compressed files. 2017-09-13 00:01:15 +01:00
Achim D. Brucker f9cf7bd35f Refactoring: moved computation of data related identifiers into own method. 2017-09-12 23:52:52 +01:00
Achim D. Brucker 8243664974 Use StringIO representation for normalizing js/css files (avoid re-reading the file content from disk). 2017-09-12 23:43:09 +01:00
Achim D. Brucker 933c4d4d11 Determine file description from buffer instead from file (avoid reading file twice). 2017-09-12 23:23:22 +01:00
Michael Herzberg 5ce3f2a148 Added until-date option. 2017-09-12 11:01:44 +01:00
Achim D. Brucker 6353202ee8 Renaming: fileinfo -> filedb. 2017-09-10 22:59:07 +01:00
Achim D. Brucker 0426d7d3d1 Reformatting. 2017-09-10 22:39:47 +01:00
Achim D. Brucker e5da9abaea Added get_file_libinfo(...). 2017-09-10 22:38:49 +01:00
Achim D. Brucker 8d9f6e4fa1 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-10 17:40:45 +01:00
Achim D. Brucker ad2af517a3 Agressively try to normalize as many filetypes as possible. 2017-09-10 17:40:30 +01:00
Achim D. Brucker 06ff5f3057 Method for computing basic file identifiers. 2017-09-10 15:57:07 +01:00
Achim D. Brucker a6e90794bc Extended const_basedir to check environment variable EXTENSION_ARCHIVE and modified main scripts to actually use const_basedir. 2017-09-10 15:55:22 +01:00
Achim D. Brucker 4b31097975 Added function for computing a list of normalized code blocks for a JavaScript file. 2017-09-10 15:02:57 +01:00
Michael Herzberg fbef566466 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-10 12:20:33 +01:00
Michael Herzberg e09cb16083 Updated path to archive. 2017-09-10 12:20:23 +01:00
Achim D. Brucker 52b42dfaef Changed pull method to return list of changed files. 2017-09-10 11:01:29 +01:00
Achim D. Brucker c3053427c0 Added method for obtaining initial commit date and pulling git repos. 2017-09-09 23:13:26 +01:00
Achim D. Brucker 08b70ed63a Updated archive dir to reflect new file hierarchy by default. 2017-09-08 21:10:40 +01:00
Achim D. Brucker a519495096 Removed outdated sync script (only useful for old sqlite-based setup). 2017-09-08 20:58:36 +01:00