Commit Graph

503 Commits

Author SHA1 Message Date
Achim D. Brucker 74da2e9c08 Initial simhash integration. 2017-11-19 00:36:15 +00:00
Achim D. Brucker acfdb9ee50 Removed unused function analyse_comment_blocks. 2017-11-18 23:21:19 +00:00
Achim D. Brucker e3519f012d Reformatting. 2017-11-17 16:58:48 +00:00
Achim D. Brucker 32c08672d9 Added log output for failed data decoding. 2017-11-16 07:13:55 +00:00
Achim D. Brucker 3db3435c07 Refactoring of heursitic detection stubs. 2017-11-15 08:05:40 +00:00
Achim D. Brucker c5dce7bcd0 Fixed decoding of content (str_data). 2017-11-15 07:12:41 +00:00
Achim D. Brucker 91e6014c6c Moved to single-threaded mode. 2017-11-12 14:07:25 +00:00
Achim D. Brucker 4cb49f2281 Merge branch 'production' 2017-11-11 21:56:33 +00:00
Achim D. Brucker 9bd283f35a Fixed use of append. 2017-11-10 00:13:06 +00:00
Achim D. Brucker 7dfbdac670 Disabled parallel updates (for debugging a deadlock situation). 2017-11-09 23:38:05 +00:00
Achim D. Brucker 5cc7a92f90 Fixed typo. 2017-11-09 00:17:09 +00:00
Achim D. Brucker ac910bf819 Updated python version to 3.6. 2017-11-07 20:58:24 +00:00
Achim D. Brucker 631f461d1f Removed not supported connection_timeout parameter. 2017-11-06 06:11:14 +00:00
Achim D. Brucker 6279bd9909 Fixed syntax error. 2017-11-05 20:14:12 +00:00
Achim D. Brucker 15079496cc Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-11-05 00:07:23 +00:00
Achim D. Brucker fcab770233 Reformatting. 2017-11-05 00:07:04 +00:00
Achim D. Brucker 07a7b346c7 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-11-04 23:14:27 +00:00
Achim D. Brucker 7ba829c90f Made python 3.6 the default. 2017-11-02 18:46:20 +00:00
Achim D. Brucker cfc26e62d7 Free git object as early as possible. 2017-10-22 21:20:49 +01:00
Achim D. Brucker 0963ea59d3 Fixed typo. 2017-10-21 20:12:19 +01:00
Achim D. Brucker d88e73167d Explicitely free git_obj. 2017-10-20 18:59:53 +01:00
Achim D. Brucker e4a8075da9 Configure timeout and retries for data base connection. 2017-10-18 20:19:43 +01:00
Achim D. Brucker 9f5d8f9b9e Added logs during creation of db connection. 2017-10-18 08:35:27 +01:00
Achim D. Brucker 14da483046 Even more logging. 2017-10-17 15:17:29 +01:00
Achim D. Brucker 37ebd510c9 Reformatting. 2017-10-16 09:47:14 +01:00
Achim D. Brucker 4ee9c51ef7 Reformatting. 2017-10-16 09:42:43 +01:00
Achim D. Brucker fc33abb7a6 Fixed logging. 2017-10-16 09:35:53 +01:00
Achim D. Brucker 8780eb8f2f Added further logging output (info). 2017-10-16 05:36:59 +01:00
Achim D. Brucker bbfbbed35a Identify ressource/media files using the file library. 2017-10-15 15:34:45 +01:00
Michael Herzberg afe137ba36 Integrated last_crx_etag into last_crx. 2017-10-14 19:59:46 +01:00
Achim D. Brucker 64bc9bd90d Make use of data base with md5 sums optional. 2017-10-14 19:17:37 +01:00
Michael Herzberg ea800da613 Create new thread after 100 extensions. 2017-10-13 15:54:29 +01:00
Achim D. Brucker 03b08db905 Bug fix: download all extension in parallel mode. 2017-10-13 10:35:51 +01:00
Michael Herzberg f51bcfbf46 Use con object from db.py. 2017-10-12 16:01:45 +01:00
Achim D. Brucker d3b7dea4d8 Added dectection based on file sizes after stripping white spaces. 2017-10-11 20:18:15 +01:00
Achim D. Brucker 10a80e2861 Compute size size after stripping. 2017-10-11 20:16:33 +01:00
Achim D. Brucker 39490ca490 Enforce block type to be code if it is not a comment. 2017-10-11 10:20:09 +01:00
Achim D. Brucker 91e0180151 Fixed indentation. 2017-10-11 09:46:39 +01:00
Achim D. Brucker a077d7e8b2 Fixed typo. 2017-10-11 09:43:54 +01:00
Achim D. Brucker dbdaa772dc Fixed typo. 2017-10-11 09:41:51 +01:00
Achim D. Brucker a4926aed19 Only store relative path for library files. 2017-10-11 09:22:27 +01:00
Achim D. Brucker 8dd745f826 Classify normalized detection as 'very likely library'. 2017-10-11 09:14:22 +01:00
Achim D. Brucker ee7ce8b446 Report stored library filename of detected libraries. 2017-10-11 08:48:20 +01:00
Achim D. Brucker 8c43fadfdb Basic implementation: check_md5_normalized(...). 2017-10-11 00:48:04 +01:00
Achim D. Brucker 154118cf50 Basic implementation: check_md5_decompressed(...). 2017-10-11 00:44:15 +01:00
Achim D. Brucker c6e5cb8511 Basic implementation: md5 checksum based library detection. 2017-10-11 00:40:06 +01:00
Achim D. Brucker 518372c6f2 Fixed library/version computation for sub-tasks. 2017-10-10 23:02:21 +01:00
Achim D. Brucker 61010a6a01 Bug fix: library identification for multi-task jobs. 2017-10-10 22:16:46 +01:00
Michael Herzberg 63ae8ac4a7 Added missing fields for cdnjs and introduced new crxfile and libdet tables. 2017-10-10 18:55:28 +01:00
Michael Herzberg 6632cd0ded Added database update for cdnjs. 2017-10-10 15:35:02 +01:00
Michael Herzberg 048990e8f8 Turned dbbackend into a package. 2017-10-10 15:10:41 +01:00
Michael Herzberg 301ad23d4c Use new review etc. table structure. 2017-10-09 17:18:01 +01:00
Michael Herzberg 2b1e55c7ec Fixed import. 2017-10-09 13:56:22 +01:00
Michael Herzberg 300a8c905a Only log last mysql exception as error, rest as warning. 2017-10-08 20:57:25 +01:00
Achim D. Brucker 25c37d83c1 Silently correct 'name use count' exception from libmagic (caused by a but in the magic Python module). 2017-10-08 15:18:58 +01:00
Achim D. Brucker 1963a20b69 Report starting positions of string literals. 2017-10-08 12:03:50 +01:00
Michael Herzberg 615b8f46a3 Fixed mysql caching. 2017-10-07 21:01:14 +01:00
Michael Herzberg 2abc386f48 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-10-06 20:13:16 +01:00
Michael Herzberg 6372c62336 Removed sorting again. 2017-10-06 20:13:08 +01:00
Achim D. Brucker 1ee76d9817 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-10-06 19:36:12 +01:00
Achim D. Brucker c1750838f1 Added support for tar files. 2017-10-06 18:33:35 +01:00
Michael Herzberg d6869455a8 Sort extension ids before processing. 2017-10-06 12:12:49 +01:00
Michael Herzberg d05194b9bb Group cached commits for efficiency. 2017-10-06 12:08:21 +01:00
Michael Herzberg 2cb56edd9b Adjusted retries for create-db. 2017-10-05 11:14:59 +01:00
Michael Herzberg 6ba73c2ed9 Changed autocommit behaviour. 2017-10-04 20:56:47 +01:00
Achim D. Brucker e63a13ae09 Bug fix: decompression. 2017-09-22 08:42:02 +01:00
Achim D. Brucker e4245ed1dd Reformatting. 2017-09-20 10:03:14 +01:00
Achim D. Brucker a63dd53e45 Refactoring. 2017-09-20 10:02:02 +01:00
Achim D. Brucker 0cb0a4226d Added option for passing a list with libs to update. 2017-09-20 07:57:14 +01:00
Michael Herzberg 4712e15249 Fixed autocommit bug. 2017-09-19 17:09:35 +01:00
Achim D. Brucker 50a7ba8a91 Minor refactoring. 2017-09-19 10:02:46 +01:00
Achim D. Brucker 4f84c5626d Minor refactoring. 2017-09-19 09:16:32 +01:00
Achim D. Brucker 061622f588 Refactoring: stub of new main analysis method. 2017-09-18 09:09:00 +01:00
Achim D. Brucker aadbc5aa0c Refactoring: removed unused variables. 2017-09-18 00:35:35 +01:00
Achim D. Brucker 50b91d3a35 Renaming jsFilename -> filename. 2017-09-18 00:30:55 +01:00
Michael Herzberg 175ebd53b7 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-17 17:45:17 +01:00
Michael Herzberg 7277e6f76e Fixed log msg bug. 2017-09-17 17:45:01 +01:00
Michael Herzberg 0cb7d6e792 Fixed error in exception handling. 2017-09-17 17:40:48 +01:00
Achim D. Brucker 3626b9fb76 Ordered and extended enumeration DetectionType. Order reflects reliability of checks. 2017-09-17 13:40:38 +01:00
Achim D. Brucker a3346cb95e Use file_identfiers module to compute file identifiers. 2017-09-17 13:18:49 +01:00
Achim D. Brucker 6d69377f28 Introduced optional parameter data to compute identifiers without opening a file handle. 2017-09-17 13:18:20 +01:00
Michael Herzberg 1fab393e56 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-16 17:23:16 +01:00
Michael Herzberg c3e295267b Log loglevel and only print stacktrace on first mysql exception. 2017-09-16 17:22:57 +01:00
Achim D. Brucker 205c8836e9 Bug fix: do not catch exceptions too aggresively and fix libvers computation for updates. 2017-09-16 17:20:23 +01:00
Achim D. Brucker 4cf41e2e4f Refactoring: moved generic file identifiers into own module. 2017-09-16 17:19:36 +01:00
Achim D. Brucker e98f58fff8 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-16 13:41:56 +01:00
Achim D. Brucker 24c65daecf Bug fix: check for dirty missed actual function application. 2017-09-16 13:41:47 +01:00
Achim D. Brucker c274b96f66 Added csv output for debugging. 2017-09-16 13:21:49 +01:00
Michael Herzberg 69e95fdf13 Catch json parse extensions for reviews etc. more nicely. 2017-09-16 12:53:35 +01:00
Michael Herzberg 58aacef3ff Reopen connection after every exception. 2017-09-16 12:31:00 +01:00
Michael Herzberg a514c0001e Added check for empty crx files. 2017-09-16 12:14:41 +01:00
Michael Herzberg b51de8577f Added compression for mysql. 2017-09-16 12:04:35 +01:00
Achim D. Brucker 92e1c4c2e5 Skip deleted files. 2017-09-16 11:41:21 +01:00
Achim D. Brucker 082cd2fc65 Added hacking pull method that uses the regular git binary. While method will not work well with filenames containg spaces and there mit be other glitches, it allows to pull an update of the cdnjs git reposistory (> 100GB) within a couple of minutes compared to a couple of days that the non hackish solution needs. 2017-09-16 11:36:40 +01:00
Achim D. Brucker 5d3343acf1 Refactoring: moved git_repo creation into pull_get_list_changed_files(...). 2017-09-16 10:33:11 +01:00
Achim D. Brucker 7b0e63da10 Implemented n/N options for external parallelisation (only for fresh initialization). 2017-09-15 22:40:46 +01:00
Achim D. Brucker 400e74ae3f Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-15 20:21:45 +01:00
Achim D. Brucker 26678636eb Ignore commits where blobs are None. 2017-09-15 20:21:05 +01:00
Michael Herzberg 85680d360b Automatically reopen database connection on failure. 2017-09-15 18:23:25 +01:00
Michael Herzberg ddbbc2672d Try to insert also other data if some inserts fail. Use autocommit to prevent data loss on retries. 2017-09-15 18:15:03 +01:00
Achim D. Brucker 936f2d3189 Log git info before starting pull (update). 2017-09-14 22:54:37 +01:00
Achim D. Brucker 2ff30f7382 Parallel execution of git date queries. 2017-09-14 15:11:53 +01:00
Achim D. Brucker 12a1e282aa The method pull_get_updated_lib_files(...) now also returns unique library/version information. 2017-09-14 10:44:30 +01:00
Achim D. Brucker e3f1202e44 Use version dictionary. 2017-09-14 10:33:00 +01:00
Achim D. Brucker f54f29c9ba Added build_release_date_dic(...). 2017-09-14 09:50:09 +01:00
Achim D. Brucker 3b217922c5 Added line count. 2017-09-13 16:41:01 +01:00
Achim D. Brucker 420eec7462 Minor memory optimizations. 2017-09-13 11:12:33 +01:00
Achim D. Brucker ec1c47625a Added support for parallel update of database. 2017-09-13 09:13:35 +01:00
Achim D. Brucker c386bd01dd Added missing string conversion. 2017-09-13 08:29:23 +01:00
Achim D. Brucker 42e685ee32 Added missing string conversion. 2017-09-13 08:01:02 +01:00
Achim D. Brucker 18fb23d3dc Use glob instead of os.walk() to avoid memory leak in the latter. 2017-09-13 04:04:38 +01:00
Achim D. Brucker 76d5993794 Added logging output. 2017-09-13 03:02:39 +01:00
Achim D. Brucker c30f7fdd7c Implemented skeleton of main routine. 2017-09-13 02:56:13 +01:00
Achim D. Brucker a8a5534be1 Renamed module. 2017-09-13 01:13:17 +01:00
Achim D. Brucker bdb84c2120 Renamed module. 2017-09-13 01:09:30 +01:00
Achim D. Brucker 4e5b52617f Catch exception during decompression and increase max. allowed size of decompressed data to 100 times of compressed size. 2017-09-13 00:23:17 +01:00
Achim D. Brucker 88efe2b8a4 Reformatting. 2017-09-13 00:02:20 +01:00
Achim D. Brucker ea9339bc53 Compute data identifiers for uncompressed content of gzip compressed files. 2017-09-13 00:01:15 +01:00
Achim D. Brucker f9cf7bd35f Refactoring: moved computation of data related identifiers into own method. 2017-09-12 23:52:52 +01:00
Achim D. Brucker 8243664974 Use StringIO representation for normalizing js/css files (avoid re-reading the file content from disk). 2017-09-12 23:43:09 +01:00
Achim D. Brucker 933c4d4d11 Determine file description from buffer instead from file (avoid reading file twice). 2017-09-12 23:23:22 +01:00
Achim D. Brucker 6353202ee8 Renaming: fileinfo -> filedb. 2017-09-10 22:59:07 +01:00
Achim D. Brucker 0426d7d3d1 Reformatting. 2017-09-10 22:39:47 +01:00
Achim D. Brucker e5da9abaea Added get_file_libinfo(...). 2017-09-10 22:38:49 +01:00
Achim D. Brucker ad2af517a3 Agressively try to normalize as many filetypes as possible. 2017-09-10 17:40:30 +01:00
Achim D. Brucker 06ff5f3057 Method for computing basic file identifiers. 2017-09-10 15:57:07 +01:00
Achim D. Brucker a6e90794bc Extended const_basedir to check environment variable EXTENSION_ARCHIVE and modified main scripts to actually use const_basedir. 2017-09-10 15:55:22 +01:00
Achim D. Brucker 4b31097975 Added function for computing a list of normalized code blocks for a JavaScript file. 2017-09-10 15:02:57 +01:00
Achim D. Brucker 52b42dfaef Changed pull method to return list of changed files. 2017-09-10 11:01:29 +01:00
Achim D. Brucker c3053427c0 Added method for obtaining initial commit date and pulling git repos. 2017-09-09 23:13:26 +01:00
Achim D. Brucker 8c33558934 Reformatting. 2017-09-07 20:09:29 +01:00
Achim D. Brucker 3b2913616b Skip first_seen if not defined. 2017-09-05 10:15:48 +01:00
Michael Herzberg a9173345e8 Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 2017-09-04 15:54:38 +01:00
Michael Herzberg 36d36facfe Relaxed mysql retries. 2017-09-04 15:54:28 +01:00
Achim D. Brucker 6395d98443 Releaxed handling of network errors. 2017-09-04 09:11:27 +01:00
Achim D. Brucker cfeb29d95f Clean-up of logging infrastructure. 2017-09-03 15:56:27 +01:00
Achim D. Brucker f42f8e3d03 Improved error handling for request failures. 2017-09-03 15:43:33 +01:00
Achim D. Brucker 872346fa61 Add timout parameter to http get requests. 2017-09-03 12:03:51 +01:00
Achim D. Brucker 0b0268e320 Copy outphased date to hash map of files archive. 2017-09-03 11:13:27 +01:00
Achim D. Brucker 0f716e98da Bug fix: only try to preserve outphased library information is there is any stored locally. 2017-09-03 11:09:39 +01:00
Achim D. Brucker 80c8e7caa0 Preserve outphased library versions. 2017-09-03 11:00:05 +01:00
Achim D. Brucker 03504ff81a Improved error handling. 2017-09-03 10:45:56 +01:00
Achim D. Brucker 13191f1ce0 Renaming: date -> first_seen. 2017-09-03 10:32:45 +01:00
Achim D. Brucker 59f9b47a81 Switched to Logging framework. 2017-09-03 10:29:57 +01:00
Achim D. Brucker 074447064c Enabled parallel download. 2017-09-03 10:06:55 +01:00
Achim D. Brucker 515a462938 Added methods for generating/updating index files based on the file hash. 2017-09-02 22:10:43 +01:00
Achim D. Brucker 9ae5905973 Generalized hash map builders. 2017-09-02 21:53:58 +01:00
Achim D. Brucker 22c3a7581d Reformatting. 2017-09-02 21:44:20 +01:00
Achim D. Brucker 3097db3790 Added methods for generating sha1 indexed dictionary. 2017-09-02 21:40:44 +01:00
Achim D. Brucker e5c2372222 Improved log output (verbose mode). 2017-09-02 20:57:01 +01:00