CofeehousePy/dltc
Netkas 3294bc4ea6 Added NSFW classification 2021-01-14 02:07:24 -05:00
..
coffeehouse_dltc Added NSFW classification 2021-01-14 02:07:24 -05:00
README.md Added NSFW classification 2021-01-14 02:07:24 -05:00
requirements.txt Added DLTC 2020-12-25 14:16:54 -05:00
setup.py Added NSFW classification 2021-01-14 02:07:24 -05:00

README.md

CoffeeHouse DLTC

CoffeeHouse Deep Learning Classification Engine is a method for creating K2 Models on large data to predict labels from them. For example, you can train the model on a bunch of "Sports" articles and "Political" articles (with the appropriate labels assigned to each article) and train the model. You can give the model a new article that's either "Sports" or "Political" related and the model will be able to predict the likely-hood of the article being Political or Sports related.

This was forked from magpie but rewritten to handle data and the training process more quickly and efficiently than the original project.

Installation

python3 setup.py install

Usage

Create a directory for your model, your directory must contain a model.json file formatted like this

{
    "model": {
        "name": "Spam Ham",
        "model_name": "spam_ham",
        "author": "Zi Xing",
        "version": "1.0.0.0",
        "description": "Model for predicting messages which contains spam or ham"
    },
    "training_properties":{
        "epoch": 35,
        "vec_dim": 100,
        "test_ratio": 0.2,
        "architecture": "cnn",
        "batch_size": 64
    },
    "classification": [
        {"l": "spam", "f": "spam.dat"},
        {"l": "ham", "f": "ham.dat"}
    ]
}

Model

Property Name Description
name The name of the model
model_nme The safe name of the model which is used for IO operations
author The author which constructed the data for the model
version The version of the model
description The description of the model, what it does, etc.

Training Properties

Property Name Description
epoch The amount of training sessions the model must run through
vec_dim The amount of word vector recreations it goes through
test_ratio splits data into train & test datasets and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of test_ratio is 0 meaning that all the data will be used for training.
architecture The type of model to train on, the possible values are cnn and rnn
batch_size The size of the batch for training purposes

Classification

Property Name Description
l The label for the data, eg; spam, ham...
f The name of the .dat file which consists of the data split into line breaks

Training the model

To train the model, the model must be clustered into a structured directory which will create a bunch of files for the data and labels which would be easier to manage and train the data from those files. In which after the temporary directory will be deleted

from coffeehouse_dltc.chmodel.configuration import Configuration

# Model directory must contain model.json and the required .dat files
configuration = Configuration('<Model Directory>')
configuration.train_model()

Once this process is done, a output directory will be created with all the generated models

File Extension Description
.che This file contains the word vectors
.chs File format responsible for the scarler data
.chm Main classification model
.chl JSON File format which contains the labels

All these files are important in order for the model data to be loaded correctly into memory

Classifying data

Assuming the model files has been created, you can load the model cluster and predict from text or file input

from coffeehouse_dltc.main import DLTC

dltc = DLTC()
dltc.load_model_cluster('<Model Directory Output>')

dltc.predict_from_text("Hello World")
# [('ham', 0.9650128), ('spam', 0.040875915)]


dltc.predict_from_file("text.txt")
# [('spam', 0.61647576), ('ham', 0.42338383)]

From the CLI

You can access CoffeeHouse-DLTC's features from the command-line interface.

python3 -m coffeehouse_dltc --model-info <source directory>
python3 -m coffeehouse_dltc --train-model <source directory>
python3 -m coffeehouse_dltc --test-model <built model directory>