ExtensionCrawler/database
Achim D. Brucker 16c13a4cfb Merge branch 'master' into production 2020-03-28 20:54:46 +00:00
..
config Added database documentation. 2018-08-09 14:20:01 +01:00
queries Added database documentation. 2018-08-09 14:20:01 +01:00
schemas Merge branch 'master' into production 2020-03-28 20:54:46 +00:00
scripts Added database documentation. 2018-08-09 14:20:01 +01:00
views Renamed extension.developer to extension.offeredby and introduced actual extension.developer (capturing the information in the developer DIV such as the privacy policy). 2020-03-02 10:30:57 +00:00
README.md Added database documentation. 2018-08-09 14:20:01 +01:00

README.md

Introduction

The extension crawler downloads all metadata and extension files into tar files. This is great for archival, but not so great for analyzing the data. The crawler therefore also supports inserting all newly crawled information into a MariaDB database. Additionally, there exists a script to regenerate the database from old tar files.

Setting up the database

Hardware requirements

The database is meant to be setup on a (old) PC, although it should also work with common cloud offerings.

The amount of data that the database needs to handle grows over time. Currently, containing ~18 months worth of data, the database requires ~150GB of space.

It is recommended to have at least 16GB of RAM to keep the indices available; less RAM might work, more RAM will certainly speed queries up. It is also good to have at least 16GB of swap; while this detrimental to the performance of MariaDB, it is often better than it being killed by the OS.

For storage, it is beneficial to have at least one HDD and one SSD, as the database workload can be split into sequential and random IO.

Configuration

A commented configuration file for MariaDB can be found in config/my.cnf. Configuration options such as pool size and storage locations will need to be adjusted.

Table schemas

To set up the tables and schemas, make sure that you have the credentials for root in your ~/.my.cnf file, and execute the following:

mysql -e "create database extensions;"
for f in schemas/*.sql; do mysql extensions < $f; done
for f in views/*.sql; do mysql extensions < $f; done

Maintaining the database

Memory consumption

MariaDB will, at times, use much more memory than specified for the pool size -- 100GB with a pool size of 4GB is certainly possible while regenerating the data. In these cases, the database should be restarted. The crawler and regeneration script will retry their database operations by default for around one hour.

Backup

Regenerating the whole data set can take days, if not weeks, so even though all data can be restored, having a backup speeds up recovery. For this purpose, the MariaDB binary log is enabled to allow physical backups, which are much faster than logical backups for our case. The folder scripts/ contains scripts to do full and incremental backups, as well as scripts to backup the schemas and users (including permissions and hashed passwords).

Regenerating extension data

When the crawler is changed to extract more or different data from the extensions, one will probably want to regenerate all data, i.e., ask the crawler to go through all existing tar files and re-extract the already downloaded data. In order to do so, the create-db or sge/create-db.sh (for HPCs) can be used. More information can be found when calling these scripts with --help.

Using the data set

Example queries

For more (commented) queries, see the queries/ folder.

select extid,crx_etag,count(filename) from extension_most_recent_small join crxfile using (crx_etag) where filename like '%.js' group by extid,crx_etag limit 10;

This query will print the number of JavaScript files per extension.

## Table schemas

All schema files can be found in the `schemas/` folder.

| Table name | Description |
| --- | --- |
| extension | General extension metadata from the store pages. One row per \
extension and crawldate (!). If you are only interested in the most recent \
*view* of the Chrome Web Store, use the `extension_most_recent` view. For \
testing your queries, suffix either table/view with *\_small* to only get \
roughly 1/256th of all extensions. |
| status | The HTTP status codes for the store page and `crx` download. |
| crx | General metadata of the extension file (the `crx` archive itself). Also \
contains the manifest. |
| crxfile | General metadata of the extension files, e.g., the files contained \
in the `crx` archives (JavaScript files, etc.).|
| category | Categories of the extensions, e.g. *productivity*, *office*, \
or *game*. |
| permission | Permissions found in the manifests, e.g., *webRequest*, *tab*, but also \
host permissions such as *https://www.google.com*. |
| content_script_url | Content script URLs found in the manifest. These are the \
URLs where the extensions request to have a content script executed when the \
user visits the website. |
| libdet | Information about used libraries. For each file found in `crx` \
archives (identified by MD5 sums), this table stores classifications of the \
file, e.g., whether it is a certain library. |
| review{,\_comment} | Post-metadata and posts from the review forum of an extension. |
| support{,\_comment} | Post-metadata and posts from the support forum of an extension. |
| reply{,\_comment} | Reply-post-metadata and posts for both the review and support forums. |

## Views

All views can be found in the `views/` folder.

| View name | Description |
| --- | --- |
| extension_small | Contains only roughly 1/256th of all extensions. |
| extension_most_recent | Instead of one row for every combination of extension \
id and crawl date, this view only contains the rows from the most recent crawl \
date. |
| extension_most_recent_small | Same, but roughly only 1/256th of all extensions. |
| extension_second_most_recent | Similar to `extension_most_recent`, but \
contains the second-most recent entry for all extensions. This is useful for \
investigating how extensions change. |
| extension_{most,second_most}_recent_until_date | Parameterized query. Only \
considers extensions crawled before a given date. Usage:  \
```sql
select * from (select @until_date:='2018-05-25') foo, extension_most_recent_until_date;
``` |
| extension_update | Selects all extension updates in the database. A row in the result represents \
one extension update, with the date and crx_etag when we have first seen the \
update, and the date and crx_etag when we have last seen the old version. As \
we crawl every night, the difference should be around 24 hours on average. |