A Python crawler for extensions from the Chrome Web Store.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Michael Herzberg 2b01bbda0e Merge branch 'master' of logicalhacking.com:BrowserSecurity/ExtensionCrawler 1 day ago
ExtensionCrawler Fixed double-logging when not using forkserver. 4 months ago
analysis/library-detector Improved plotting for angular. 1 day ago
bench Added md5 db benchmark. 6 months ago
database Added database documentation. 5 months ago
resources Updated regexps. 1 year ago
scripts Removed reporting of WorkerExceptions. 2 months ago
sge Restrict sharc jobs to 1 hour. 5 months ago
singularity Added libpython3.7-dev as dependency. 3 weeks ago
.gitignore Updated image name(s). 1 year ago
LICENSE initial commit 2 years ago
README.md Added SPDX identifier. 6 months ago
cdnjs-git-miner Added SPDX identifier. 4 months ago
comparemd5 Changed way of parallelism for simhashbucket and added comparemd5. 6 months ago
crawler Testing python 3.7. 5 days ago
create-db Added support for xz compressed archives. 2 months ago
crx-extract Added SPDX identifier. 4 months ago
crx-jsinventory Added SPDX identifier. 4 months ago
crx-jsstrings Added SPDX identifier. 4 months ago
crx-tool Using python 3.7. 4 days ago
extfind Fixed style errors and warnings. 9 months ago
extfind.py Updated grepper. 1 year ago
requirements.txt Increased requests version (dependency). 2 months ago
setup.py Fixed style errors and warnings. 9 months ago
simhashbucket Added MD5Table and made max simhash dist configurable. 4 months ago



A collection of utilities for downloading and analyzing browser extension from the Chrome Web store.

  • crawler: A crawler for extensions from the Chrome Web Store.
  • crx-tool: A tool for analyzing and extracting *.crx files (i.e., Chrome extensions). Calling crx-tool.py <extension>.crx will check the integrity of the extension.
  • crx-extract: A simple tool for extracting *.crx files from the tar-based archive hierarchy.
  • crx-jsinventory: Build a JavaScript inventory of a *.crx file using a JavaScript decomposition analysis.
  • crx-jsstrings: A tool for extracting code blocks, comment blocks, and string literals from JavaScript.
  • create-db: A tool for updating a remote MariaDB from already existing extension archives.

The utilities store the extensions in the following directory hierarchy:

   ├── conf
   │   └── forums.conf
   ├── data
   │   └── ...
   └── log
       └── ...

The crawler downloads the most recent extension (i.e., the *.crx file as well as the overview page. In addition, the conf directory may contain one file, called forums.conf that lists the ids of extensions for which the forums and support pages should be downloaded as well. The data directory will contain the downloaded extensions.

The crawler and create-db scripts will access and update a MariaDB. They will use the host, datebase, and credentials found in ~/.my.cnf. Since they make use of various JSON features, it is recommended to use at least version 10.2.8 of MariaDB.

All utilities are written in Python 3.6. The required modules are listed in the file requirements.txt.


Clone and use pip3 to install as a package.

git clone git@logicalhacking.com:BrowserSecurity/ExtensionCrawler.git
pip3 install --user -e ExtensionCrawler



  • Mehmet Balande


This project is licensed under the GPL 3.0 (or any later version).

SPDX-License-Identifier: GPL-3.0-or-later

Master Repository

The master git repository for this project is hosted by the Software Assurance & Security Research Team at https://git.logicalhacking.com/BrowserSecurity/ExtensionCrawler.