Welcome!

This is the official repository for StarChat --an Open Source, scalable conversational engine for B2B applications. You can find the code on our github repository.

How to contribute

To contribute to StarChat, please send us a pull request from your fork of this repository.

Our concise contribution guideline contains the bare minumum requirements of the code contributions.

Before contributing (or opening issues), you might want send us an email at starchat@getjenny.com.

Quick Start

Requirements

The easiest way is to install StarChat using two docker images. You only need:

In this way, you will put all the indices in the Elasticsearch (version 6.1) image, and StarChat itself in the Java (8) image.

If you do not use docker you therefore need on your machine:

  1. Scala 12.2
  2. Elasticsearch 5.4

1. Launch Docker containers

We have made available all the containers needed for having StarChat up and running without local compiling.

To use them, you need to download Starchat Docker or simply type:

git clone git@github.com:GetJenny/starchat-docker.git

Now just type:

docker-compose up -d

And you will have a running instance on port 8888. (If you want to change ports, go to starchat-docker/starchat/config/application.conf)

(Problems like elastisearch exited with code 78? have a look at troubleshooting!)

Change Manaus Language

In the file docker-compose.yml change:

    command: ["/manaus/scripts/wait-for-it.sh", 
    "getjenny-es", "9200", "5", 
    "/manaus/bin/continuous-keywords-update", "--temp_data_folder", 
    "/manaus/data", 
    "--host_map", "getjenny-es=9300", "--interval_sec", "14400", 
    "--word_frequencies", 
    "/manaus/statistics_data/english/word_frequency.tsv", 
    "--cluster_name", "starchat", "--index_name", "jenny-en-0"]

to

    command: ["/manaus/scripts/wait-for-it.sh", 
    "getjenny-es", "9200", "5", 
    "/manaus/bin/continuous-keywords-update", "--temp_data_folder", 
    "/manaus/data", 
    "--host_map", "getjenny-es=9300", "--interval_sec", "14400", 
    "--word_frequencies", "/manaus/statistics_data/***italian***/word_frequency.tsv", 
    "--cluster_name", "starchat", "--index_name",  "***jenny-it-0***"]

2. Create Elasticsearch indices

Run from a terminal:

# create the system indices in Elasticsearch
PORT=${1:-8888}
curl -v -H "Authorization: Basic `echo -n 'admin:adminp4ssw0rd' | base64`" \
  -H "Content-Type: application/json" -X POST "http://localhost:${PORT}/system_index_management/create"

If you are using another language than English, replace english in the name of the index:

# create the application indices on Elasticsearch
PORT=${1:-8888}
INDEX_NAME=${2:-index_getjenny_english_0}
curl -v -H "Authorization: Basic `echo -n 'admin:adminp4ssw0rd' | base64`" \
  -H "Content-Type: application/json" -X POST "http://localhost:${PORT}/${INDEX_NAME}/index_management/create"
# add a user to the system associated to the application index index_getjenny_english_0 previously created
PORT=${1:-8888}
curl -v -H "Authorization: Basic `echo -n 'admin:adminp4ssw0rd' | base64`" \
  -H "Content-Type: application/json" -X POST http://localhost:${PORT}/user -d '{
        "id": "test_user",
        "password": "3c98bf19cb962ac4cd0227142b3495ab1be46534061919f792254b80c0f3e566f7819cae73bdc616af0ff555f7460ac96d88d56338d659ebd93e2be858ce1cf9", 
        "salt": "salt",
        "permissions": {
                "index_getjenny_english_0": ["read", "write"]
        }
}'

3. Load the configuration file

Now you have to load the configuration file for the actual chat, aka decision table. We have provided an example configuration file in English, therefore:

cd $STARCHAT  # so we have doc/decision_table_starchat_doc.csv
PORT=${1:-8888}
INDEX_NAME=${2:-'index_getjenny_english_0'}
FILENAME=${3:-"`readlink -e doc/decision_table_starchat_doc.csv`"}
curl -v -H "Authorization: Basic `echo -n 'test_user:p4ssw0rd' | base64`" \
  --form "csv=@${FILENAME}" http://localhost:8888/${INDEX_NAME}/decisiontable_upload_csv

and then you need to index the analyzer:

PORT=${1:-8888}
INDEX_NAME=${2:-index_getjenny_english_0}
curl -v -H "Authorization: Basic `echo -n 'test_user:p4ssw0rd' | base64`" \
  -H "Content-Type: application/json" -X POST "http://localhost:${PORT}/${INDEX_NAME}/decisiontable_analyzer"

In case you want to delete all states previously loaded, this endpoint deletes all the states:

PORT=${1:-8888}
INDEX_NAME=${2:-index_getjenny_english_0}
curl -v -H "Authorization: Basic `echo -n 'test_user:p4ssw0rd' | base64`" \
  -H "Content-Type: application/json" -X DELETE http://localhost:${PORT}/${INDEX_NAME}/decisiontable

4. Load external corpus (optional)

To have a good words' statistics, and consequent improved matching, you might want to index a corpus which is hidden from results. For instance, you can index various sentences as hidden using the POST /knowledgebase endpoint with doctype: "hidden".

5. Index the FAQs (optional)

TODO: You might want to activate the knowledge base for simple Question and Anwer.

Install without Docker

Note: we do not support this installation.

The service binds on the port 8888 by default.

Install local Docker (for testing branches)

Generate a packet distribution. In StarChat directory:

sbt dist

Enter the directory docker-starchat:

cd  docker-starchat

You will get a message like Your package is ready in ...../target/universal/starchat-4ee.... .zip. Extract the packet into the docker-starchat folder:

unzip ../target/universal/starchat-4eee.....zip
ln -s starchat-4ee..../  starchat

The zip packet contains:

Review the configuration files starchat/config/application.conf and configure the language if needed (by default you have index_language = "english")

(If you are re-installing StarChat, and want to start from scratch see start from scratch.)

Start both startchat and elasticsearch:

docker-compose up -d

(Problems like elastisearch exited with code 78? have a look at troubleshooting!)

Test the installation

Is the service working? But first: did you load a configuration file? If yes, try:

curl -X GET localhost:8888 | python -mjson.tool

Get the test_state

PORT=${2:-8888}
INDEX_NAME=${3:-index_getjenny_english_0}
curl -v -H "Authorization: Basic `echo -n 'test_user:p4ssw0rd' | base64`" \
 -H "Content-Type: application/json" -X POST http://localhost:${PORT}/${INDEX_NAME}/get_next_response -d '{
 "conversation_id": "1234",
  "user_input": { "text": "install starchat" },
  "values": {
      "return_value": "",
      "data": {}
       }
  }'

You should get:

[
   {
      "state" : "install",
      "data" : {},
      "action" : "",
      "success_value" : "",
      "state_data" : {},
      "score" : 1,
      "conversation_id" : "1234",
      "traversed_states" : [
         "install"
      ],
      "max_state_count" : 0,
      "action_input" : {},
      "analyzer" : "band(bor(keyword(\"setup\"), keyword(\"install.*\")), bnot(bor(keyword(\"standalone\"), keyword(\"docker\"))))",
      "bubble" : "Just choose one of the two:\n<ul>\n<li>docker install (recommended)</li>\n<li>standalone install</li>\n</ul>",
      "failure_value" : ""
   }
]

If you look at the "analyzer" field, you'll see that this state is triggered when the user types the test and either get or send. Try with "text": "Please dont send me the test state" and StarChat will send an empty message.

Configuration of the chatbot (Decision Table)

With StarChat's Decision Table you can easily implement workflow-based chatbots. After the installation (see above) you only have to configure a conversation flow and eventually a front-end client.

NLP processing

NLP processing is of course the core of any chatbot. As you have noted in the CSV provided in the doc/ directory there are two fields defining when StarChat should trigger a state -analyzer and queries.

Queries

If the analyzer field is empty, StarChat will query Elasticsearch for the state containing the most similar sentence in the field queries. We have carefully configured Elasticsearch in order to provide good answers (e.g. boosting results where the same words appear etc), and results are... acceptable. But you are encouraged to use the analyzer field, documented below.

Analyzer

The analyzers are a Domain Specific Language which allow to put together various functions using logical operators.

Through the analyzers, you can leverage on various NLP algorithms included in StarChat and combine the results of those algorithms with other rules. For instance you might want to get into the state which asks for the email only if a variable "email" is not set. Or you want to escalate to a human operator after you detect swearing for three times. Or you want to escalate only on working days. You can do all that with the analyzers.

We can have a look at the simple example included in the CSV provided in the doc/ directory for the state forgot_password:

booleanAnd(keyword("password"),booleanOr(keyword("reset"),keyword("forgot")))

The analyzer above says the state must be triggered if the term "password" is detected together with either "reset" or "forgot".

Another example. Imagine we have a state called send-updates. In this state StarChat proposes the question "Where can we send you updates?". In the config file:

state           | ... | bubble                              | ...
send-updates    | ... | "Where can we send you updates?"    |

In another state, called send-email, we have the anlyzer field with:

booleanAnd(lastTravStateIs("send-updates"), matchEmailAddress("verification_"))

this means send-email will be triggered only after send-updates (because of lastTravStateIs) and if an email address is detected (because of matchEmailAddress). This will also set the variable verification_email because the expression matchEmailAddress in case of success always sets such variable, with the expression's argument as prefix.

In addition to that, the expression sendVerificationEmail could be developed (we haven't) which accepts others arguments, for example:

sendVerificationEmail("verification_", "Email about Verification", "Here is your verification link %__temp__verification_link%)

In this case, the expression would

TODO It is fundamental here to build a set of metadata which allows any other component to receive all needed information about the analyzer. For instance, the sendVerificationEmail could have something like:

[
    "documentation": "Send an email with subject and body. If successful returns 1.0 and sets the variables. If not returns 0 and does not set the variables."
    "argument_list": ["'prefix' of the variable 'prefix_email'", 
                      "Subject of the email to be sent", "Body of the mail"],
    "created_variables": {  // Variables it creates
        "__temp__verification_link": "Link provided by the brand's API", //will after usage because starts with __temp__
        "prefix_email": "email address"
        },
    "used_variables": {  // Variables it expects to find in the state (none here)
        }

How the analyzers are organized

The analyzer DSL building blocks are the expression. For instance, or, and, keyword are all expressions.

Espressions are then divided into operators (or...) and atoms (keyword).

Expressions: Atoms

Presently, the keyword("reset") in the example provides a very simple score: occurrence of the word reset in the user's query divided by the total number of words. If evaluated agains the sentence "Want to reset my password", keyword("reset") will currently return 0.2.

TODO: This is just a temporary score used while our NLP library manaus is not integrated into the decision table.

These are currently the expression you can use to evaluate the goodness of a query (see DefaultFactoryAtomic and StarchatFactoryAtomic:

Expressions: Operators

Operators evaluate the output of one or more expression and return a value. Currently, the following operators are implemented (the the source code):

Technical corner: expressions

Expressions, like keywords in the example, are called atoms, and have the following methods/members:

  1. def evaluate(query: String): Double: produce a score. It might be normalized to 1 or not (set val isEvaluateNormalized: Boolean accordingly)
  2. val match_threshold This is the threshold above which the expression is considered true when matches is called. NB The default value is 0.0, which is normally not ideal.
  3. def matches(query: String): Boolean: calles evaluate and check agains the threshold...
  4. val rx: the name of the atom, as it should be used in the analyzer field.

Configuration of the answer recommender (Knowledge Base)

Through the /knowledgebase endpoint you can add, edit and remove pairs of question and answers used by StarChat to recommend possible answers when a question arrives.

Documents containing Q&A must be structured like that:

{
    "id": "0",  // id of the pair
    "conversation": "id:1000",   // id of the conversation. This can be useful to external services
    "index_in_conversation": 1,  // when the pair appears inside the conversation, as above
    "question": "thank you",  // The question to be matched
    "answer": "you are welcome!",  // The answer to be recommended 
    "question_scored_terms": [  // A list of keyword and score. You can use your own keyword extractor or our Manaus (see later)
        [
            "thank", 
            1.9
        ]
    ],
    "verified": true,  // A variable used in some call centers
    "topics": "t1 t2",  // Eventual topics to be associated
    "dclass": "", // Optional field as a searchable class for answer
    "doctype": "normal",
    "state": "",
    "status": 0
}

See POST /knowledgebase for an example with curl. Other calls (GET, DELETE, PUT) are used to get the state, delete it or update it.

Test the knowledge base

Just run the example in POST /knowledgebase_search.

Manaus

In the Q&A pair above you saw the field question_scored_terms. Having good keywords improves enormously the quality of the answer. You can of course put them by hand, or run a software which extracts the keywords from the question. If you prefer the latter, but don't have any, we provide Manaus.

Manaus is still under development, but it's already included in the Docker's installation of StarChat. When you launch docker-compose up -d, you also launch a container with Manaus which analyzes all the Q&A in the Knowledge Base, produces keywords and updates the field question_scored_terms for all documents. The process is repeated evert 4 hour.

Manaus configuration

Have a look at the file docker-starchat/docker-compose.yml. For Manaus to have good performance, you need to provide decent language statistics. Update the file /manaus/statistics_data/english/word_frequency.tsv with a word-frequency file with the following format:

1       you     1222421
2       I       1052546
3       to      823661
....

We have frequency file for more than 50 languages, but consider that the choice of good "prior distribution" of word frequency is crucial for any NLP task.

Technology

StarChat was design with the following goals in mind:

  1. easy deployment
  2. multi-tenancy: starchat can handle different KnowledBase and DecisionTable configurations
  3. horizontally scalability without any service interruption.
  4. modularity
  5. statelessness

How does StarChat work?

Workflow

alt tag

Components

StarChat uses Elasticsearch as NoSQL database and, as said above, NLP preprocessor, for indexing, sentence cleansing, and tokenization.

Services

StarChat consists of the following REST resources.

Root

The root endpoint provides just an health check endpoint

SystemIndexManagement

The SystemIndexManagement set of endpoints provides a means to set up and manage the system tables.

IndexManagement

The IndexManagement REST endpoints allows to create new indexes for new instances (StarChat is multitenant).

LanguageGuesser

Offers endpoints to guess the language given a text.

QuestionAnswer type (same API syntax and semantic):

The following REST endpoints have the same syntax and semantic but they serve different needs.

KnowledgeBase

For quick setup based on real Q&A logs. It stores question and answers pairs. Given a text as input it proposes the pair with the closest match on the question field. At the moment the KnowledBase supports only Analyzers implemented on Elasticsearch.

PriorData

The prior data contains text to be used for extraction of statistics about terms and text. The data are used primarity for terms extraction (Manaus)

ConversationLogs

Endpoint to collect and store the conversation logs.

Tokenizer

Endpoint which exposes text tokenization functionalities.

Spellcheck

Endpoint which exposes spellcheck functionalities, the terms statistics are taken from the KnowledgeBase.

Term

Endpoint to store informations about terms: synonyms, antonyms and vectorial representation.

TermsExtraction

Exposes Manaus functionalities to extract significant terms from the text. It need data from the PriorData and the domain specific dataset (KnowledgeBase).

AnalyzersPlayground

Exposes REST endpoints to test the analyzers on the fly.

DecisionTable

The conversational engine itself. For the usage, see below.

Configuration of the DecisionTable

You configure the DecisionTable through CSV file. Please have a look at the CSV provided in the doc/ directory.

Fields in the configuration file are of three types:

And the fields are:

Client functions

In StarChat configuration, the developer can specify which function the front-end should execute when a certain state is triggered, together with input parameters. Any function implemented on the front-end can be called.

Example show button

Example "buttons": the front-end implements the function show_buttons and uses "action input" to call it. It will show two buttons, where the first returns forgot_password and the second account_locked.

Example send email

Example "send email": the front-end implements the function send_password_link and uses "action input" to call it. The variable %email% is automatically substituted by the variable email if available in the JSON passed to the StarchatResource.

functions for the sample csv

For the CSV in the example above, the client will have to implement the following set of functions:

Ref: sample_state_machine_specification.csv.

Mechanics

Scalability

StarChat consists of two different services: StarChat itself and an Elasticsearch cluster.

Scaling StarChat instances

Since StarChat is stateless it can scale horizontally by replication. All the instances can access to all the configured indexes on ElasticSearch and can answer to the APIs call enforcing authentication and authorization rules. A load balancer will be responsible of scheduling the requests to the instances transparently.

In the diagram below, a load balancer forward requests coming from the front-end to StarChat instances which access to the indices created on the Elasticsearch cluster.

Image

Scaling Elasticsearch

Similarly, Elasticsearch can easily scale horizontally adding new nodes to the cluster, as explained in Elasticsearch Documentation.

Security

StarChat is a backend service which supports authentication and authorization with salted SHA512 hashed password and differentiated permissions.

The admin hashed password and salt are stored on the StarChat configuration file, the user credentials (hashed password, salt, permissions) are instead saved on ElasticSearch (further backend for authentication/authorization can be implemented by specialization of the Auth. classes).

StarChat support TLS connections, the configuration file allow to choose if the service should expose an https connection or an http connection or both. In order to use the https connection the user must do the following things:

Follows the block of the configuration file which is to be modified as described above in order to use https:

https {
  host = "0.0.0.0"
  host = ${?HOST}
  port = 8443
  port = ${?PORT}
  certificate = "server.p12"
  password = "uma7KnKwvh"
  enable = false
}

http {
  host = "0.0.0.0"
  host = ${?HOST}
  port = 8888
  port = ${?PORT}
  enable = true
}

StarChat come with a default self-signed certificate for testing, using it for production or sensitive environment is highly discouraged as well as useless from a security point of view.

Indexing terms on term table

The following program index term vectors on the vector table:

sbt "run-main com.getjenny.command.IndexTerms --inputfile terms.txt --vecsize 300"

The format for each row of an input file with 5 dimension vectors is: hello 1.0 2.0 3.0 4.0 0.0

You can use your ad-hoc trained vector model (as we do) otherwise you can use the google word2vec models trained on google news. You can find a copy of the elasticsearch index with a pre-loaded google news terms.

Test

Unit tests

A set of unit test is available using docker-compose to set up a backend, the command to run tests is:

sbt dockerComposeUp ; sbt test ; sbt dockerComposeStop

test scripts with sample API calls

Troubleshooting

Docker: start from scratch

You might want to start from scratch, and delete all docker images.

If you do so (docker images and then docker rmi -f <java/elasticsearch ids>) remember that all data for the Elasticsearch docker are local, and mounted only when the container is up. Therefore you need to:

cd docker-starchat
rm -rf elasticsearch/data/nodes/

docker-compose: Analyzers are not loaded

StarChat is started immediately after elasticsearch and it is possible that elasticsearch is not ready to respond to REST calls from StarChat (i.e. an index not found error could be raised in this case).

Sample error on the logs:

2017-06-15 10:37:22,993 >10:37:22.992UTC ERROR c.g.s.s.AnalyzerService(akka://starchat-service) com.getjenny.starchat.services.AnalyzerService(akka://starchat-service) - can't load analyzers: [jenny-en-0]
 IndexNotFoundException[no such index]

In order to avoid this problem you can call the services one by one:

docker-compose up elasticsearch # here wait elasticsearch is up and running
docker-compose up starchat # starchat will retrieve the Analyzers from elasticsearch

In alternative is possible to call the command to load/refresh the Analyzers after the docker-compose command:

curl -v -H "Content-Type: application/json" -X POST "http://localhost:8888/decisiontable_analyzer"

Docker: Size of virtual memory

If elasticsearch complain about the size of the virtual memory:

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
elastisearch exited with code 78

run:

sysctl -w vm.max_map_count=262144