A Python crawler for extensions from the Chrome Web Store.
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
Go to file
Achim D. Brucker d491ff2950 Increased yscale. 3 years ago
ExtensionCrawler Renamed extension.developer to extension.offeredby and introduced actual extension.developer (capturing the information in the developer DIV such as the privacy policy). 3 years ago
PermissionAnalysis Don't crash on invalid categories. 4 years ago
analysis/library-detector Improved plotting for angular. 4 years ago
database Merge branch 'master' into production 3 years ago
resources Updated regexps. 6 years ago
scripts Increased yscale. 3 years ago
sge Added option to handle more than one extension per sharc job. 4 years ago
.gitignore Added pycharm folders to gitignore. 4 years ago
LICENSE initial commit 7 years ago
README.md Updated python version. 4 years ago
cdnjs-git-miner Update to Python 3.7. 4 years ago
crawler Store log in montly directory and replace : by _ in names of log files. 4 years ago
create-db Switched to python 3.7. 4 years ago
crx-extract Using python 3.7. 4 years ago
crx-jsinventory Moved to Python 3.7. 4 years ago
crx-jsstrings Moved to Python 3.7. 4 years ago
crx-tool Using python 3.7. 4 years ago
extgrep Use ast parser to parse ETag. 4 years ago
requirements.txt Increased requests version (dependency). 5 years ago
setup.py Fixed style errors and warnings. 5 years ago
simhashbucket Switched to python 3.7. 4 years ago

README.md

ExtensionCrawler

A collection of utilities for downloading and analyzing browser extension from the Chrome Web store.

  • crawler: A crawler for extensions from the Chrome Web Store.
  • crx-tool: A tool for analyzing and extracting *.crx files (i.e., Chrome extensions). Calling crx-tool.py <extension>.crx will check the integrity of the extension.
  • crx-extract: A simple tool for extracting *.crx files from the tar-based archive hierarchy.
  • crx-jsinventory: Build a JavaScript inventory of a *.crx file using a JavaScript decomposition analysis.
  • crx-jsstrings: A tool for extracting code blocks, comment blocks, and string literals from JavaScript.
  • create-db: A tool for updating a remote MariaDB from already existing extension archives.

The utilities store the extensions in the following directory hierarchy:

   archive
   ├── conf
   │   └── forums.conf
   ├── data
   │   └── ...
   └── log
       └── ...

The crawler downloads the most recent extension (i.e., the *.crx file as well as the overview page. In addition, the conf directory may contain one file, called forums.conf that lists the ids of extensions for which the forums and support pages should be downloaded as well. The data directory will contain the downloaded extensions.

The crawler and create-db scripts will access and update a MariaDB. They will use the host, datebase, and credentials found in ~/.my.cnf. Since they make use of various JSON features, it is recommended to use at least version 10.2.8 of MariaDB.

All utilities are written in Python 3.7. The required modules are listed in the file requirements.txt.

Installation

Clone and use pip3 to install as a package.

git clone git@logicalhacking.com:BrowserSecurity/ExtensionCrawler.git
pip3 install --user -e ExtensionCrawler

Team

Contributors

  • Mehmet Balande

License

This project is licensed under the GPL 3.0 (or any later version).

SPDX-License-Identifier: GPL-3.0-or-later

Master Repository

The master git repository for this project is hosted by the Software Assurance & Security Research Team at https://git.logicalhacking.com/BrowserSecurity/ExtensionCrawler.