CofeehousePy/dltc/README.md

# CoffeeHouse DLTC

CoffeeHouse Deep Learning Classification Engine is a method for creating K2 Models on large data
to predict labels from them. For example, you can train the model on a bunch of "Sports" articles
and "Political" articles (with the appropriate labels assigned to each article) and train the
model. You can give the model a new article that's either "Sports" or "Political" related and 
the model will be able to predict the likely-hood of the article being Political or Sports related.

This was forked from [magpie](https://github.com/inspirehep/magpie) but rewritten to handle data
and the training process more quickly and efficiently than the original project. 

# Installation

```shell script
python3 setup.py install
```

# Usage

Create a directory for your model, your directory must contain a model.json file
formatted like this

```json
{
    "model": {
        "name": "Spam Ham",
        "model_name": "spam_ham",
        "author": "Zi Xing",
        "version": "1.0.0.0",
        "description": "Model for predicting messages which contains spam or ham"
    },
    "training_properties":{
        "epoch": 35,
        "vec_dim": 100,
        "test_ratio": 0.2,
        "architecture": "cnn",
        "batch_size": 64
    },
    "classification": [
        {"l": "spam", "f": "spam.dat"},
        {"l": "ham", "f": "ham.dat"}
    ]
}

 ```

### Model

| Property Name | Description                                                |
|---------------|------------------------------------------------------------|
| name          | The name of the model                                      |
| model_nme     | The safe name of the model which is used for IO operations |
| author        | The author which constructed the data for the model        |
| version       | The version of the model                                   |
| description   | The description of the model, what it does, etc.           |


### Training Properties

| Property Name | Description                                                                                                                                                                                                           |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| epoch         | The amount of training sessions the model must run through                                                                                                                                                            |
| vec_dim       | The amount of word vector recreations it goes through                                                                                                                                                                 |
| test_ratio    | splits data into train & test datasets and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of  `test_ratio` is 0 meaning that all the data will be used for training. |
| architecture  | The type of model to train on, the possible values are `cnn` and `rnn`                                                                                                                                                |
| batch_size    | The size of the batch for training purposes                                                                                                                                                                           |

### Classification

| Property Name | Description                                                                 |
|---------------|-----------------------------------------------------------------------------|
| l             | The label for the data, eg; `spam`, `ham`...                                |
| f             | The name of the .dat file which consists of the data split into line breaks |


## Training the model

To train the model, the model must be clustered into a structured directory which will create a
bunch of files for the data and labels which would be easier to manage and train the data from
those files. In which after the temporary directory will be deleted

```python
from coffeehouse_dltc.chmodel.configuration import Configuration

# Model directory must contain model.json and the required .dat files
configuration = Configuration('<Model Directory>')
configuration.train_model()
```

Once this process is done, a output directory will be created with all the generated models

| File Extension | Description                                  |
|----------------|----------------------------------------------|
| `.che`         | This file contains the word vectors          |
| `.chs`         | File format responsible for the scarler data |
| `.chm`         | Main classification model                    |
| `.chl`         | JSON File format which contains the labels   |

All these files are important in order for the model data to be loaded correctly into memory


## Classifying data

Assuming the model files has been created, you can load the model cluster and
predict from text or file input

```python
from coffeehouse_dltc.main import DLTC

dltc = DLTC()
dltc.load_model_cluster('<Model Directory Output>')

dltc.predict_from_text("Hello World")
# [('ham', 0.9650128), ('spam', 0.040875915)]


dltc.predict_from_file("text.txt")
# [('spam', 0.61647576), ('ham', 0.42338383)]
```


## From the CLI

You can access CoffeeHouse-DLTC's features from the command-line interface.

```shell script
python3 -m coffeehouse_dltc --model-info <source directory>
python3 -m coffeehouse_dltc --train-model <source directory>
python3 -m coffeehouse_dltc --test-model <built model directory>
```
Added DLTC 2020-12-25 20:16:54 +01:00			`# CoffeeHouse DLTC`

			`CoffeeHouse Deep Learning Classification Engine is a method for creating K2 Models on large data`
			`to predict labels from them. For example, you can train the model on a bunch of "Sports" articles`
			`and "Political" articles (with the appropriate labels assigned to each article) and train the`
			`model. You can give the model a new article that's either "Sports" or "Political" related and`
			`the model will be able to predict the likely-hood of the article being Political or Sports related.`

			`This was forked from [magpie](https://github.com/inspirehep/magpie) but rewritten to handle data`
			`and the training process more quickly and efficiently than the original project.`

			`# Installation`

			```shell script
			`python3 setup.py install`
			```

			`# Usage`

			`Create a directory for your model, your directory must contain a model.json file`
			`formatted like this`

			```json
			`{`
			`"model": {`
			`"name": "Spam Ham",`
			`"model_name": "spam_ham",`
			`"author": "Zi Xing",`
			`"version": "1.0.0.0",`
			`"description": "Model for predicting messages which contains spam or ham"`
			`},`
			`"training_properties":{`
			`"epoch": 35,`
			`"vec_dim": 100,`
			`"test_ratio": 0.2,`
			`"architecture": "cnn",`
			`"batch_size": 64`
			`},`
			`"classification": [`
			`{"l": "spam", "f": "spam.dat"},`
			`{"l": "ham", "f": "ham.dat"}`
			`]`
			`}`

			```

			`### Model`

			`\| Property Name \| Description \|`
			`\|---------------\|------------------------------------------------------------\|`
			`\| name \| The name of the model \|`
			`\| model_nme \| The safe name of the model which is used for IO operations \|`
			`\| author \| The author which constructed the data for the model \|`
			`\| version \| The version of the model \|`
			`\| description \| The description of the model, what it does, etc. \|`


			`### Training Properties`

			`\| Property Name \| Description \|`
			`\|---------------\|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|`
			`\| epoch \| The amount of training sessions the model must run through \|`
			`\| vec_dim \| The amount of word vector recreations it goes through \|`
			\| test_ratio \| splits data into train & test datasets and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of `test_ratio` is 0 meaning that all the data will be used for training. \|
			\| architecture \| The type of model to train on, the possible values are `cnn` and `rnn` \|
			`\| batch_size \| The size of the batch for training purposes \|`

			`### Classification`

			`\| Property Name \| Description \|`
			`\|---------------\|-----------------------------------------------------------------------------\|`
			\| l \| The label for the data, eg; `spam`, `ham`... \|
			`\| f \| The name of the .dat file which consists of the data split into line breaks \|`


			`## Training the model`

			`To train the model, the model must be clustered into a structured directory which will create a`
			`bunch of files for the data and labels which would be easier to manage and train the data from`
			`those files. In which after the temporary directory will be deleted`

			```python
			`from coffeehouse_dltc.chmodel.configuration import Configuration`

			`# Model directory must contain model.json and the required .dat files`
			`configuration = Configuration('<Model Directory>')`
			`configuration.train_model()`
			```

			`Once this process is done, a output directory will be created with all the generated models`

			`\| File Extension \| Description \|`
			`\|----------------\|----------------------------------------------\|`
			\| `.che` \| This file contains the word vectors \|
			\| `.chs` \| File format responsible for the scarler data \|
			\| `.chm` \| Main classification model \|
			\| `.chl` \| JSON File format which contains the labels \|

			`All these files are important in order for the model data to be loaded correctly into memory`


			`## Classifying data`

			`Assuming the model files has been created, you can load the model cluster and`
			`predict from text or file input`

			```python
			`from coffeehouse_dltc.main import DLTC`

			`dltc = DLTC()`
			`dltc.load_model_cluster('<Model Directory Output>')`

			`dltc.predict_from_text("Hello World")`
			`# [('ham', 0.9650128), ('spam', 0.040875915)]`


			`dltc.predict_from_file("text.txt")`
			`# [('spam', 0.61647576), ('ham', 0.42338383)]`
			```


			`## From the CLI`

			`You can access CoffeeHouse-DLTC's features from the command-line interface.`

			```shell script
			`python3 -m coffeehouse_dltc --model-info <source directory>`
			`python3 -m coffeehouse_dltc --train-model <source directory>`
			`python3 -m coffeehouse_dltc --test-model <built model directory>`
			```