A Python crawler for extensions from the Chrome Web Store.
Go to file
Achim D. Brucker 9ddddc50c4 Initial commit. 2017-10-10 20:32:30 +01:00
ExtensionCrawler Added missing fields for cdnjs and introduced new crxfile and libdet tables. 2017-10-10 18:55:28 +01:00
queries Fixed queries. 2017-09-02 16:45:38 +01:00
resources Updated regexps. 2017-09-01 22:26:30 +01:00
scripts Increased number of parallel runs. 2017-10-10 17:19:27 +01:00
sge Initial commit. 2017-10-10 20:32:30 +01:00
singularity Fixed links and access permissions. 2017-10-10 20:26:44 +01:00
.gitignore Updated image name(s). 2017-09-26 08:56:00 +01:00
LICENSE initial commit 2016-09-08 20:43:35 +02:00
README.md Changed pip to pip3. 2017-09-02 00:07:50 +01:00
cdnjs-crawler Surpess log messages from requests lib. 2017-10-10 15:11:03 +01:00
cdnjs-git-miner Reformatting. 2017-09-20 10:03:14 +01:00
crawler This time actually disable annoying HTTPS log messages. 2017-09-18 13:41:54 +01:00
create-db Adjusted retries for create-db. 2017-10-05 11:14:59 +01:00
crx-extract Extended const_basedir to check environment variable EXTENSION_ARCHIVE and modified main scripts to actually use const_basedir. 2017-09-10 15:55:22 +01:00
crx-jsinventory Renaming jsFilename -> filename. 2017-09-18 00:30:55 +01:00
crx-jsstrings Bug fix: print line ranges n case of machting joint string literals. 2017-10-09 11:59:44 +01:00
crx-tool Use python3.5 for all files. 2017-09-01 14:12:05 +01:00
extfind Added extfind util. 2017-09-21 15:40:20 +01:00
extfind.py Updated grepper. 2017-09-27 14:05:16 +01:00
grepper Updated grepper. 2017-09-27 14:05:16 +01:00
requirements.txt Added jsbeautifier dependency for grepper. 2017-09-21 11:05:19 +01:00
setup.py Implemented option for colorized output. 2017-10-06 19:35:11 +01:00

README.md

ExtensionCrawler

A collection of utilities for downloading and analyzing browser extension from the Chrome Web store.

  • crawler: A crawler for extensions from the Chrome Web Store.
  • crx-tool: A tool for analyzing and extracting *.crx files (i.e., Chrome extensions). Calling crx-tool.py <extension>.crx will check the integrity of the extension.
  • crx-extract: A simple tool for extracting *.crx files from the tar-based archive hierarchy.
  • crx-jsinventory: Build a JavaScript inventory of a *.crx file using a JavaScript decomposition analysis.
  • crx-jsstrings: A tool for extracting code blocks, comment blocks, and string literals from JavaScript.
  • create-db: A tool for updating a remote MariaDB from already existing extension archives.

The utilities store the extensions in the following directory hierarchy:

   archive
   ├── conf
   │   └── forums.conf
   ├── data
   │   └── ...
   └── log
       └── ...

The crawler downloads the most recent extension (i.e., the *.crx file as well as the overview page. In addition, the conf directory may contain one file, called forums.conf that lists the ids of extensions for which the forums and support pages should be downloaded as well. The data directory will contain the downloaded extensions.

The crawler and create-db scripts will access and update a MariaDB. They will use the host, datebase, and credentials found in ~/.my.cnf. Since they make use of various JSON features, it is recommended to use at least version 10.2.8 of MariaDB.

All utilities are written in Python 3.x. The required modules are listed in the file requirements.txt.

Installation

Clone and use pip3 to install as a package.

git clone git@logicalhacking.com:BrowserSecurity/ExtensionCrawler.git
pip3 install --user -e ExtensionCrawler

Team

Contributors

  • Mehmet Balande

License

This project is licensed under the GPL 3.0 (or any later version).