Welcome!

This is the official repository for StarChat --an Open Source, scalable conversational engine for B2B applications. You can find the code on our github repository.

How to contribute

To contribute to StarChat, please send us a pull request from your fork of this repository.

Our concise contribution guideline contains the bare minumum requirements of the code contributions.

Before contributing (or opening issues), you might want send us an email at starchat@getjenny.com.

Quick Start

1. Installation

The easiest way is to install StarChat using two docker images. You only need: docker docker compose

We have made available the containers needed for having StarChat up and running without local compiling: On container is for the Elasticsearch (version 7.0.0) image the other for the StarChat itself in the Java (8) image.

For instruction about docker installation on Ubuntu platform refer to docker for Ubuntu

1.1 Run Prebuilt Docker containers

To use them, you need to download Starchat Docker or simply type:

git clone  https://github.com/GetJenny/starchat-docker.git

Get into the starchat-docker directory and:

docker-compose up -d

If you get an ERROR: Version in "./docker-compose.yml" is unsupported. you need to update docker-compose to the version indicated in the docker-compose.yml file. Note that it might not be available on old version of ubuntu (we need to use a very recent one). If that's the case see eg stackoverflow. If you want to change default ports, eg because you have other services on 8888/9200/9300, change the values in docker-compose.yml.

To test elasticsearch and starchat started correctly you can send the following command

curl -X GET localhost:8888

If you don't get OK replay you can check the output logs of elasticsearch and start-chat starting the containers without detached mode option.

docker-compose up

Elasticsearch makes some boostrap check. If they failed elasiticsearch quits with message elastisearch exited with code 78. Possible fails reason are: 1. max file descriptors [4096] for elasticsearch process is too low, increase to at least [65535] 2. max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144] have a look at troubleshooting in order to solve them.

StarChat supports the user authentication and the default Elasticsearch user used is the built-in user "elastic". The password can be changed using the following command from a running container:

docker exec -i -t docker-starchat_elasticsearch_1 /usr/share/elasticsearch/bin/elasticsearch-keystore add "bootstrap.password"

When using this command the file elasticsearch.keystore is updated. The password can be set even up on ES using a native user.

1.2 Elasticsearch Configuration

Now you need to configure Elastic search performing the following operations:

Creating the system indices
Creating the application indices
Creating a new user associated with read/write privileges on the application indices

The three steps can be accomplished by running some scripts that can be found in the starchat-docker/scripts directory. The scripts assume you are using the default port 8888 and the index based on english language. If you need to change these parameters edit accordingly the variabile PORT and INDEX_NAME you can find in the scripts

cd starchat/scripts/api_test
#System Indices creation (admin privileged execution)
./postSystemIndexManagementCreate.sh
./postLanguageIndexManagement.sh
#Application Indices creation (admin privileged execution)
./postIndexManagementCreate.sh
#User with creation (admin privileged execution)
./insertUser.sh

Scripts are based on RESTful API documented here.

1.3 Chat Decision Table Configuration

Now you have to load the configuration file for the actual chat, aka decision table. We have provided the example configuration file starchat-docker/scripts/decision_table_starchat_doc.csv running the script

./loadDecisionTableFile.sh [8443] [index_getjenny_english_0]    [decision_table_file.csv]

and then you need to index the analyzer:

./postIndexAnalyzer.sh

In case you want to delete all states previously loaded you need to delete the decision table previously loaded running the script

./postDeleteAllDecisionTables.sh

1.5 Index the FAQs (optional)

TODO: You might want to activate the knowledge base for simple Question and Anwer.

1.6 Installation Test

Is the service working? But first: did you load a configuration file? If yes, try:

curl -X GET localhost:8888

Now try to ask question to the bot running the script:

./getNextResponseSearch.sh "contribute"

You should get:

[
   {
      "action":"",
      "actionInput":{},
      "analyzer":"bor(keyword(\"contribute\"))",
      "bubble":"To contribute to <a href=\"http://git.io/*chat\">StarChat</a>, please send us a pull request from your fork of this repository.\n<br>Our concise contribution guideline contains the bare minimum requirements of the code contributions.\n<br>Before contributing (or opening issues), you might want to email us at starchat@getjenny.com.",
      "conversationId":"1234",
      "data":{

      },
      "failureValue":"",
      "maxStateCount":0,
      "score":1.0,
      "state":"contribute",
      "stateData":{

      },
      "successValue":"",
      "traversedStates":[
         "contribute"
      ]
   }
]

If you look at the "analyzer" field, you'll see that this state is triggered when the user types the contribute. The "bubble" field contains the response.

2. Development Envinroment

The easiest way to modify the StarChat source code, recompile and test changes is to:

Install sbt
Clone [StarChat repository] (https://github.com/GetJenny/starchat)
Clone Starchat Docker as explained at 1.1

and running the commands

## Start ElasticSearch from docker preconfigured container
cd starchat-docker/
docker-compose up elasticsearch

## Compile and Run StarChat
cd starchat/
sbt compile run

Now StarChat is running and you can configure and test the installation as explained in Installation

3. Docker Image Creation for testing branches

If you want to generate a new docker image to be distributed you need to run in StarChat directory:

sbt dist

Enter the directory docker-starchat:

cd  docker-starchat

You will get a message like Your package is ready in ...../target/universal/starchat-4ee.... .zip. Extract the packet into the docker-starchat folder:

unzip ../target/universal/starchat-4eee.....zip
ln -s starchat-4ee..../  starchat

The zip packet contains:

a set of command line programs starchat/bin to run starchat and other tools.
a set of scripts to test the endpoints and as a complement for the documentation: starchat/scripts/api_test/
delete-decision-table: application to delete items from the decision table
index-corpus-on-knowledge-base: application to index a corpus on knowledge base as hidden (to improve the language model)
index-decision-table: application to index data on the decision table from a csv
index-knowledge-base: application to index data into the knowledge base
index-terms: application to index terms vectors

Review the configuration files starchat/config/application.conf and configure the language if needed (by default you have index_language = "english")

(If you are re-installing StarChat, and want to start from scratch see start from scratch.)

Start both startchat and elasticsearch:

docker-compose up -d

Now StarChat is running and you can configure and test the installation as explained in Installation If you get org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes be sure that docker-starchat/elasticsearch is accessible to docker service.

Configuration of the chatbot (Decision Table)

With StarChat's Decision Table you can easily implement workflow-based chatbots. After the installation (see above) you only have to configure a conversation flow and eventually a front-end client.

NLP processing

NLP processing is of course the core of any chatbot. As you have noted in the CSV provided in the doc/ directory there are two fields defining when StarChat should trigger a state -analyzer and queries.

Queries

If the analyzer field is empty, StarChat will query Elasticsearch for the state containing the most similar sentence in the field queries. We have carefully configured Elasticsearch in order to provide good answers (e.g. boosting results where the same words appear etc), and results are... acceptable. But you are encouraged to use the analyzer field, documented below.

Analyzer

The analyzers are a Domain Specific Language which allow to put together various functions using logical operators.

Through the analyzers, you can leverage on various NLP algorithms included in StarChat and combine the results of those algorithms with other rules. For instance you might want to get into the state which asks for the email only if a variable "email" is not set. Or you want to escalate to a human operator after you detect swearing for three times. Or you want to escalate only on working days. You can do all that with the analyzers.

We can have a look at the simple example included in the CSV provided in the doc/ directory for the state forgot_password:

booleanAnd(keyword("password"),booleanOr(keyword("reset"),keyword("forgot")))

The analyzer above says the state must be triggered if the term "password" is detected together with either "reset" or "forgot".

Another example. Imagine we have a state called send-updates. In this state StarChat proposes the question "Where can we send you updates?". In the config file:

state           | ... | bubble                              | ...
send-updates    | ... | "Where can we send you updates?"    |

In another state, called send-email, we have the anlyzer field with:

booleanAnd(lastTravStateIs("send-updates"), matchEmailAddress("verification_"))

this means send-email will be triggered only after send-updates (because of lastTravStateIs) and if an email address is detected (because of matchEmailAddress). This will also set the variable verification_email because the expression matchEmailAddress in case of success always sets such variable, with the expression's argument as prefix.

In addition to that, the expression sendVerificationEmail could be developed (we haven't) which accepts others arguments, for example:

sendVerificationEmail("verification_", "Email about Verification", "Here is your verification link %__temp__verification_link%)

In this case, the expression would

extract an email address from the user's query
set a variable verification_email with such address
retrieve a verification link from some API
put that link into the temporary variable __temp__verification_link.
send an email with subject "Email about Verification" and body "Here is your..."
return 1 in case of success

TODO It is fundamental here to build a set of metadata which allows any other component to receive all needed information about the analyzer. For instance, the sendVerificationEmail could have something like:

[
    "documentation": "Send an email with subject and body. If successful returns 1.0 and sets the variables. If not returns 0 and does not set the variables."
    "argument_list": ["'prefix' of the variable 'prefix_email'", 
                      "Subject of the email to be sent", "Body of the mail"],
    "created_variables": {  // Variables it creates
        "__temp__verification_link": "Link provided by the brand's API", //will after usage because starts with __temp__
        "prefix_email": "email address"
        },
    "used_variables": {  // Variables it expects to find in the state (none here)
        }

How the `analyzers` are organized

The analyzer DSL building blocks are the expression. For instance, or, and, keyword are all expressions.

Espressions are then divided into operators (or...) and atoms (keyword).

Expressions: Atoms

Presently, the keyword("reset") in the example provides a very simple score: occurrence of the word reset in the user's query divided by the total number of words. If evaluated agains the sentence "Want to reset my password", keyword("reset") will currently return 0.2.

TODO: This is just a temporary score used while our NLP library manaus is not integrated into the decision table.

These are currently the expression you can use to evaluate the goodness of a query (see DefaultFactoryAtomic and StarchatFactoryAtomic:

keyword("pass.*"): as explained above, detects any word starting with "pass". Normalized.
regex(regex): evaluate a regular expression, not normalized
search(state-name): take a state name as argument, queries elastic search and returns the score of the most similar query in the field queries of the argument's state. In other words, it does what it would do without any analyzer, only with a normalized score -e.g. search("lost_password_state")
matchPatternRegex(regex): A generic pattern extraction analyzer, it extract named patterns matching a given regex. E.g., the following will match tree numbers separated by column: [first,second,third](?:([0-9]+:[0-9]:[0-9]+)). If the regex matches, it will create the entries into the state variables dictionary. The string "10:11:12" will result in Map("first.0" -> "10", "second.0" -> "11", "third.0" -> "12"). The number at the end of the name is an index incremented for multiple occurrences of the pattern in the query
matchDateDDMMYYYY(prefix): parse a date in DDMMYYYY format and set prefixday.0, prefixmonth.0, prefixyear.0.
existsVariable(variable-name): check whether a variable exists or not
hasTravState(state-name): check if a state_name is present into the history of traversed states
lastTravStateIs(state-name): check if the last traversed state is state_name
prevTravStateIs(state-name): check if the last but one traversed state is state_name
distance("forget.", ..., "pass."): score based on the (cosine) distance between the query and the list of words in the argument.
checkDayOfMonth: Check if the current time is Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the argument which is an integer between 1 and 31
first argument is the day of the month: a number between 1 and 31
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
third argument is the timezone: UTC, GMT, UT, CET, UTC+, UTC-, GMT+, GMT-, UT+ or UT- where N is a number between -18 and +18. Default is CET
checkDayOfWeek: Check if the current time is Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the argument which is an integer between 1 (MONDAY) and 7 (SUNDAY)
first argument is the day of the week: a number between 1 and 7
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
third argument is the timezone: UTC, GMT, UT, CET, UTC+, UTC-, GMT+, GMT-, UT+ or UT- where N is a number between -18 and +18. Default is CET
checkHour: Check if the current time is Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the argument time in EPOC
first argument is the hour: a number between 0 and 23
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
third argument is the timezone: UTC, GMT, UT, CET, UTC+, UTC-, GMT+, GMT-, UT+ or UT- where N is a number between -18 and +18. Default is CET
checkMinute: Check if the current minutes are Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the first argument
first argument is the minute: a number between 0 and 59
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
third argument is the timezone: UTC, GMT, UT, CET, UTC+, UTC-, GMT+, GMT-, UT+ or UT- where N is a number between -18 and +18. Default is CET
checkMonth: Check if the current month is Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the argument which is an integer between 1 (JANUARY) and 12 (DECEMBER)
first argument is the month's number: an integer between 1 and 12
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
third argument is the timezone: UTC, GMT, UT, CET, UTC+, UTC-, GMT+, GMT-, UT+ or UT- where N is a number between -18 and +18. Default is CET
checkTimestamp: Check if the current time is Equal, LessOrEqual, Less, Greater, GreaterOrEqual to the argument time in EPOC
first argument is the timestamp: a timestamp in EPOC (in seconds)
second argument is the operator: any of Equal, LessOrEqual, Less, Greater, GreaterOrEqual
double: transform a string into a double, useful for score comparison (see, eq, lt, gt, lte, gte operators)

Expressions: Operators

Operators evaluate the output of one or more expression and return a value. Currently, the following operators are implemented (the the source code):

boolean or: calles matches of all the exprassions it contains and returns true or false. It can be called using bor
boolean and: as above, it's called with band
boolean not: ditto, bnot
conjunction: if the evaluation of the expressions it contains is normalized, and they can be seen as probabilities of them being true, this is the probability that all the expressions are all true (P(A)*P(B))
disjunction: as above, the probability that at least one is true (1-(1-P(A))*(1-P(B)))
max: takes the max score of returned by the expression arguments
reinfConjunction: conjuction which uses a 1.1 value instead of 1 as boolean true
binarize: take the result of an expression output 1.0 if the score is > 0, 0.0 otherwise
eq compare the score of two expression, output 1.0 if they are the same, 0.0 otherwise
lt compare the score of two expression, output 1.0 if the first arg is less than the second, 0.0 otherwise
gt compare the score of two expression, output 1.0 if the first arg is greater than the second, 0.0 otherwise
lte compare the score of two expression, output 1.0 if the first arg is less or equal than the second, 0.0 otherwise
gte compare the score of two expression, output 1.0 if the first arg is greater or equal than the second, 0.0 otherwise

Technical corner: `expressions`

Expressions, like keywords in the example, are called atoms, and have the following methods/members:

def evaluate(query: String): Double: produce a score. It might be normalized to 1 or not (set val isEvaluateNormalized: Boolean accordingly)
val match_threshold This is the threshold above which the expression is considered true when matches is called. NB The default value is 0.0, which is normally not ideal.
def matches(query: String): Boolean: calles evaluate and check agains the threshold...
val rx: the name of the atom, as it should be used in the analyzer field.

Configuration of the answer recommender (Knowledge Base)

Through the /knowledgebase endpoint you can add, edit and remove pairs of question and answers used by StarChat to recommend possible answers when a question arrives.

Documents containing Q&A must be structured like that:

{
    "id": "0",  // id of the pair
    "conversation": "id:1000",   // id of the conversation. This can be useful to external services
    "index_in_conversation": 1,  // when the pair appears inside the conversation, as above
    "question": "thank you",  // The question to be matched
    "answer": "you are welcome!",  // The answer to be recommended 
    "question_scored_terms": [  // A list of keyword and score. You can use your own keyword extractor or our Manaus (see later)
        [
            "thank", 
            1.9
        ]
    ],
    "verified": true,  // A variable used in some call centers
    "topics": "t1 t2",  // Eventual topics to be associated
    "dclass": "", // Optional field as a searchable class for answer
    "doctype": "normal",
    "state": "",
    "status": 0
}

See POST /knowledgebase for an example with curl. Other calls (GET, DELETE, PUT) are used to get the state, delete it or update it.

Test the knowledge base

Just run the example in POST /knowledgebase_search.

Manaus

In the Q&A pair above you saw the field question_scored_terms. Having good keywords improves enormously the quality of the answer. You can of course put them by hand, or run a software which extracts the keywords from the question. If you prefer the latter, but don't have any, we provide Manaus.

Manaus is still under development, but it's already included in the Docker's installation of StarChat. When you launch docker-compose up -d, you also launch a container with Manaus which analyzes all the Q&A in the Knowledge Base, produces keywords and updates the field question_scored_terms for all documents. The process is repeated evert 4 hour.

Manaus configuration

Have a look at the file docker-starchat/docker-compose.yml. For Manaus to have good performance, you need to provide decent language statistics. Update the file /manaus/statistics_data/english/word_frequency.tsv with a word-frequency file with the following format:

1       you     1222421
2       I       1052546
3       to      823661
....

We have frequency file for more than 50 languages, but consider that the choice of good "prior distribution" of word frequency is crucial for any NLP task.

Technology

StarChat was design with the following goals in mind:

easy deployment
multi-tenancy: starchat can handle different KnowledBase and DecisionTable configurations
horizontally scalability without any service interruption.
modularity
statelessness

How does StarChat work?

Workflow

alt tag

Components

StarChat uses Elasticsearch as NoSQL database and, as said above, NLP preprocessor, for indexing, sentence cleansing, and tokenization.

Services

StarChat consists of the following REST resources.

Root

The root endpoint provides just an health check endpoint

SystemIndexManagement

The SystemIndexManagement set of endpoints provides a means to set up and manage the system tables.

IndexManagement

The IndexManagement REST endpoints allows to create new indexes for new instances (StarChat is multitenant).

LanguageGuesser

Offers endpoints to guess the language given a text.

QuestionAnswer type (same API syntax and semantic):

The following REST endpoints have the same syntax and semantic but they serve different needs.

KnowledgeBase

For quick setup based on real Q&A logs. It stores question and answers pairs. Given a text as input it proposes the pair with the closest match on the question field. At the moment the KnowledBase supports only Analyzers implemented on Elasticsearch.

PriorData

The prior data contains text to be used for extraction of statistics about terms and text. The data are used primarity for terms extraction (Manaus)

ConversationLogs

Endpoint to collect and store the conversation logs.

Tokenizer

Endpoint which exposes text tokenization functionalities.

Spellcheck

Endpoint which exposes spellcheck functionalities, the terms statistics are taken from the KnowledgeBase.

Term

Endpoint to store informations about terms: synonyms, antonyms and vectorial representation.

TermsExtraction

Exposes Manaus functionalities to extract significant terms from the text. It need data from the PriorData and the domain specific dataset (KnowledgeBase).

AnalyzersPlayground

Exposes REST endpoints to test the analyzers on the fly.

DecisionTable

The conversational engine itself. For the usage, see below.

Configuration of the DecisionTable

You configure the DecisionTable through CSV file. Please have a look at the CSV provided in the doc/ directory.

Fields in the configuration file are of three types:

(R): Return value: the field is returned by the API
(T): Triggers to the state: when should we enter this state?
(I): Internal: a field not exposed to the API

And the fields are:

state: a unique name of the state (e.g. forgot_password)
execution_order: specify an order of evaluation for analyzers the lower is the number earlier is the evaluation of the state
max_state_count: defines how many times StarChat can repropose the state during a conversation.
analyzer (T,I): specify an analyzer expression which triggers the state
query (T,I): list of sentences whose meaning identify the state
bubble (R): content, if any, to be shown to the user. It may contain variables like %email% or %link%.
action (R): a function to be called on the client side. StarChat developer must provide types of input and output (like an abstract method), and the GUI developer is responsible for the actual implementation (e.g. show_button)
action_input (R): input passed to action's function (e.g., for show_buttons can be a list of pairs ("text to be shown on button", state_to_go_when_clicked)
state_data (R): a dictionary of strings with arbitrary data to pass along
success_value (R): output to return in case of success
failure_value (R): output to return in case of failure

Client functions

In StarChat configuration, the developer can specify which function the front-end should execute when a certain state is triggered, together with input parameters. Any function implemented on the front-end can be called.

Example show button

Action: show_buttons
Action input: {"buttons": [("Forgot Password", "forgot_password"), ("Account locked", "account_locked")]}
The frontend will call function: show_buttons(buttons={"Forgot Password": "forgot_password","Account locked": "account_locked")

Example "buttons": the front-end implements the function show_buttons and uses "action input" to call it. It will show two buttons, where the first returns forgot_password and the second account_locked.

Example send email

Action: send_password_link
Action input: { "template": "Reset your password here: example.com","email": "%email%","subject": "New password" }
The frontend will call function: send_password_link(template="Reset your password here: example.com.",email= "john@foo.com", subject="New password")

Example "send email": the front-end implements the function send_password_link and uses "action input" to call it. The variable %email% is automatically substituted by the variable email if available in the JSON passed to the StarchatResource.

functions for the sample csv

For the CSV in the example above, the client will have to implement the following set of functions:

show_buttons: tell the client to render a multiple choice button
input: a key/value pair with the key indicating the text to be shown in the button, and the value indicating the state to follow e.g.: {"Forgot Password": "forgot_password", "Account locked": "account_locked", "Specify your problem": "specify_problem", "I want to call an operator": "call_operator", "None of the above": "start"}
output: the choice related to the button clicked by the user e.g.: "account_locked"
input_form: render an input form or collect the input following a specific format
input: a dictionary with the list of fields and the type of fields, at least "email" must be supported: e.g.: { "email": "email" } where the key is the name and the value is the type
output: a dictionary with the input values e.g.: { "email": "foo@example.com" }
send_password_generation_link: send an email with instructions to regenerate the password
input: a valid email address e.g.: "foo@example.com"
output: a dictionary with the response fields e.g.: { "user_id": "123", "current_state": "forgot_password", "status": "true" }

Ref: sample_state_machine_specification.csv.

Mechanics

The client implements the functions which appear in the action field of the spreadsheet. We will provide interfaces.
The client call the rest API "decisiontable" endpoint communicating a state if any, the user input data and other state variables
The client receive a response with guidance on what to return to the user and what are the possible next steps
The client render the message to the user and eventually collect the input, then call again the system to get instructions on what to do next
When the "decisiontable" functions does not return any result the user can call the "knowledgebase" endpoint which contains all the conversations.

Scalability

StarChat consists of two different services: StarChat itself and an Elasticsearch cluster.

Scaling StarChat instances

Since StarChat is stateless it can scale horizontally by replication. All the instances can access to all the configured indexes on ElasticSearch and can answer to the APIs call enforcing authentication and authorization rules. A load balancer will be responsible of scheduling the requests to the instances transparently.

In the diagram below, a load balancer forward requests coming from the front-end to StarChat instances which access to the indices created on the Elasticsearch cluster.

Scaling Elasticsearch

Similarly, Elasticsearch can easily scale horizontally adding new nodes to the cluster, as explained in Elasticsearch Documentation.

Security

StarChat is a backend service which supports authentication and authorization with salted SHA512 hashed password and differentiated permissions.

The admin hashed password and salt are stored on the StarChat configuration file, the user credentials (hashed password, salt, permissions) are instead saved on ElasticSearch (further backend for authentication/authorization can be implemented by specialization of the Auth. classes).

StarChat support TLS connections, the configuration file allow to choose if the service should expose an https connection or an http connection or both. In order to use the https connection the user must do the following things:

obtain a pkcs12 server certificate from a certification authority or create a self signed certificate
save the certificate inside the folder config/tls/certs/ e.g. config/tls/certs/server.p12
set the password for the certificate inside the configuration file
enable the https connection setting to true the https.enable property of the configuration file
optionally disable the http connection setting to false the http.enable property of the configuration file

Follows the block of the configuration file which is to be modified as described above in order to use https:

https {
  host = "0.0.0.0"
  host = ${?HOST}
  port = 8443
  port = ${?PORT}
  certificate = "server.p12"
  password = "uma7KnKwvh"
  enable = false
}

http {
  host = "0.0.0.0"
  host = ${?HOST}
  port = 8888
  port = ${?PORT}
  enable = true
}

StarChat come with a default self-signed certificate for testing, using it for production or sensitive environment is highly discouraged as well as useless from a security point of view.

Indexing terms on term table

The following program index term vectors on the vector table:

sbt "run-main com.getjenny.command.IndexTerms --inputfile terms.txt --vecsize 300"

The format for each row of an input file with 5 dimension vectors is: hello 1.0 2.0 3.0 4.0 0.0

You can use your ad-hoc trained vector model (as we do) otherwise you can use the google word2vec models trained on google news. You can find a copy of the elasticsearch index with a pre-loaded google news terms.

Test

Unit tests

A set of unit test is available using docker-compose to set up a backend, the command to run tests is:

sbt dockerComposeUp ; sbt test ; sbt dockerComposeStop

test scripts with sample API calls

A set of test script is present inside scripts/api_test

Troubleshooting

Docker: start from scratch

You might want to start from scratch, and delete all docker images.

If you do so (docker images and then docker rmi -f <java/elasticsearch ids>) remember that all data for the Elasticsearch docker are local, and mounted only when the container is up. Therefore you need to:

cd docker-starchat
rm -rf elasticsearch/data/nodes/

docker-compose: Analyzers are not loaded

StarChat is started immediately after elasticsearch and it is possible that elasticsearch is not ready to respond to REST calls from StarChat (i.e. an index not found error could be raised in this case).

Sample error on the logs:

2017-06-15 10:37:22,993 >10:37:22.992UTC ERROR c.g.s.s.AnalyzerService(akka://starchat-service) com.getjenny.starchat.services.AnalyzerService(akka://starchat-service) - can't load analyzers: [jenny-en-0]
 IndexNotFoundException[no such index]

In order to avoid this problem you can call the services one by one:

docker-compose up elasticsearch # here wait elasticsearch is up and running
docker-compose up starchat # starchat will retrieve the Analyzers from elasticsearch

In alternative is possible to call the command to load/refresh the Analyzers after the docker-compose command:

curl -v -H "Content-Type: application/json" -X POST "http://localhost:8888/decisiontable_analyzer"

Docker: Elasticsearch required size of virtual memory

If elasticsearch complains about the size of the virtual memory:

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
elastisearch exited with code 78

run:

sysctl -w vm.max_map_count=262144

Docker: Elasticsearch required open files limit

If elasticsearch complains about the limit of the open files:

max file descriptors [4096] for elasticsearch process is too low, increase to at least [65535]

you should increase the limit of max open files. Here you can find instruction for Ubuntu 18.04

`sbt dist`

In case you have ununderstandable errors from sbt dist, like this one:

[error] null
[error] scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5239)
[error] scala.tools.nsc.transform.Erasure$Eraser.typed1(Erasure.scala:789)
[error] scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5617)
[error] scala.tools.nsc.transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
[error] scala.tools.nsc.transform.Erasure$Eraser.typed1(Erasure.scala:789)

try to increase the amount of memory given to the jvm, adding javacOptions += "-Xss2048K" in build.sbt just above the scalacOptions.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search