ExtensionCrawler/README.md

42 lines
1.6 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ExtensionCrawler
A collection of utilities for downloading and analyzing browser
extension from the Chrome Web store.
* `crawler`: A crawler for extensions from the Chrome Web Store.
* `crx-tool`: A tool for analyzing and extracting `*.crx` files
(i.e., Chrome extensions). Calling `crx-tool.py <extension>.crx`
will check the integrity of the extension.
* `extract-crx`: A simple tool for extracint `*.crx` files from the
tar-based archive hierarchy.
* `create-db`: A tool for creating/initializing the database files
from already existing extension archives.
The utilities store the extensions in the following directory
hierarchy:
```
archive
   ├── conf
   │   └── forums.conf
   ├── data
   │   └── ...
   └── log
   └── ...
```
The crawler downloads the most recent extension (i.e., the `*.crx`
file as well as the overview page. In addition, the `conf` directory
may contain one file, called `forums.conf` that lists the ids of
extensions for which the forums and support pages should be downloaded
as well. The `data` directory will contain the downloaded extensions
as well as sqlite files containing the extracted meta data. The sqlite
files can easily be re-generated using the `create-db` tool.
All utilities are written in Python 3.x. The required modules are listed
in the file `requirements.txt`.
## Team
* [Achim D. Brucker](http://www.brucker.ch/)
* [Michael Herzberg](http://www.dcs.shef.ac.uk/cgi-bin/makeperson?M.Herzberg)
## License
This project is licensed under the GPL 3.0 (or any later version).