[PYTHON] I tried Watson Speech to Text

Thing you want to do

Try Watson's Speech to text. Try running the sample demo site below (https://www.ibm.com/blogs/watson/2016/07/getting-robots-listen-using-watsons-speech-text-service/)


Watson's voice authentication (Speech to Text) for creating Raspberry Pi Robo that can convert video audio into text in real time Try.

As shown in the figure below, the final goal is voice authentication and transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text). (http://qiita.com/nanako_ut/items/1e044eb494623a3961a5)

This time, we will search for the watson voice authentication method in part (4) of the figure below. img20170324_14192489.jpg



The following is assumed to be ready. --User registration to watson (It seems that all services can be used free of charge for one month after registration) --Created Speech to Text service with watson and obtained credentials


  1. Connect with curl (upload audio file)
  2. Connect with python Part 1 (audio file upload)
  3. Connection with python Part 2 (Real-time voice analysis with WebSocket connection)

■ Connect with curl (upload audio file)

1.1 Upload audio file

Specify the audio file (test.wat) and upload it to watson via HTTP connection

curl -X POST -u username:passward --header "Content-Type: audio/wav" --header "Transfer-Encoding: chunked" --data-binary @test.wav "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=ja-JP_BroadbandModel"

1.2 Execution result

Something has returned. But ... the characters are garbled ... Is Raspberry Pi UTF-8 garbled due to Japanese analysis results (S-JIS?)? ?? 20170331.PNG

■ Connection with python Part 1 (audio file upload)

Implemented with reference to this sample source Getting robots to listen: Using Watson’s Speech to Text service

2.1 Environmental maintenance

python library for watson watson-developer-cloud-0.23.0 installation

pip installation

Not required if pip is already installed. It wasn't in the Raspberry Pi I'm using, probably because I put RASPBIAN JESSIE LITE in Raspberry Pi 3. .. ..

$ python -m pip -V
/usr/bin/python: No module named pip

$ sudo apt-get install python-pip
Reading package lists... Done
Building dependency tree
~ Halfway through ~

$ python -m pip -V
pip 1.5.6 from /usr/lib/python2.7/dist-packages (python 2.7)


$ sudo pip install -U pip
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.6
    Not uninstalling pip at /usr/lib/python2.7/dist-packages, owned by OS
Successfully installed pip
Cleaning up...

$ python -m pip -V
pip 9.0.1 from /usr/local/lib/python2.7/dist-packages (python 2.7)

watson-developer-cloud installation

$ sudo pip install --upgrade watson-developer-cloud
Collecting watson-developer-cloud
  Downloading watson-developer-cloud-0.23.0.tar.gz (52kB)
~ Halfway through ~
Successfully installed pysolr-3.6.0 requests-2.12.5 watson-developer-cloud-0.23.0

2.2 Execution program

Copy the referenced site


from watson_developer_cloud import SpeechToTextV1
import json

stt = SpeechToTextV1(username="username", password="password")
audio_file = open("test1.wav", "rb")
print json.dumps(stt.recognize(audio_file, content_type="audio/wav"), indent=2)

2.3 Execution

Something came back. It seems that the text is being returned. However, it should have been a longer voice, but the text was cut off in the middle! ?? ??

  "results": [
      "alternatives": [
          "confidence": 0.438,
          "transcript": "so we know it's coming Julio just say yeah lost me grow mandatory right here shone like a great kid fifth grader etan Allemand planning his fifth critics "
      "final": true
  "result_index": 0

■ Connection with python Part 2 (Real-time voice analysis with WebSocket connection)

It seems that you can analyze voice in real time by using something called webSocket.

3.1 What is webSocket?

(https://www.html5rocks.com/ja/tutorials/websockets/basics/) The WebSocket specification defines an API that establishes a "socket" connection between a web browser and a server. Simply put, there is a persistent connection between the client and the server, and either side can start sending data at any time.

It seems.

(http://www.atmarkit.co.jp/ait/articles/1111/11/news135.html) In HTML5, a new communication standard called "WebSocket" has been added. Feature

Once a connection is established between the server and the client, data can be exchanged via socket communication without being aware of the communication procedure unless explicitly disconnected. A server with a WebSocket connection and all clients can share the same data and send and receive in real time. In the conventional communication technology, an HTTP header is added each time communication is performed, so in addition to sending and receiving data according to the number of connections, a small amount of traffic is generated and resources are consumed. WebSocket sends a handshake request from the client side to continue using the connection on the first connection. The server side uses one connection by returning a handshake response and continues. It seems.

I see. .. ..

3.2 Environmental improvement

Install ws4py library for webSocket

$ sudo pip install ws4py
Collecting ws4py
  Downloading ws4py-0.3.5-py2-none-any.whl (40kB)
    100% |????????????????????????????????| 40kB 661kB/s
Installing collected packages: ws4py
Successfully installed ws4py-0.3.5

3.2 Execution program

Copy the referenced site


from ws4py.client.threadedclient import WebSocketClient
import base64, time

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

        username = "username"
        password = "password"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")

            WebSocketClient.__init__(self, ws_url,
                headers=[("Authorization", "Basic %s" % base64string)])
        except: print "Failed to open WebSocket."

    def opened(self):
        self.send('{"action": "start", "content-type": "audio/l16;rate=16000"}')

    def received_message(self, message):
        print message

stt_client = SpeechToTextClient()

3.3 Execution

Audio data is returned.

$ python watson_test2.py
Message received: {u'state': u'listening'}
sleep audio
Recording raw data 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Message received: {u'results': [{u'alternatives': [{u'confidence': 0.713, u'transcript': u'over the entire course of the scalp was it was all the guys that one rings before imagine '}], u'final': True}], u'result_index': 0}

3.4 Challenges

Hmmm, even though it's real-time, no matter how much voice data you send, you can only receive the first message. Is there any option, or is the data passed badly? It seems that we need to find out a little more.


It seems that the UI of bluemix is changing steadily, the URL of Speech to text is different from the sample, and it is still under development. The drawback is that it takes time to investigate. .. ..

Recommended Posts

I tried Watson Speech to Text
I tried using Azure Speech to Text.
I tried mushrooms Pepper x IBM Bluemix Text to Speech
I tried to classify text using TensorFlow
I tried to debug.
I tried to paste
I implemented Google's Speech to text in Django
I tried to learn PredNet
I tried to organize SVM.
I tried to implement PCANet
I tried to reintroduce Linux
Speech to speech in python [text to speech]
I tried to introduce Pylint
I tried to summarize SparseMatrix
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to implement Deep VQE
I tried to create Quip API
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried to explain Pytorch dataset
I tried to touch Tesla's API
I tried to implement hierarchical clustering
I tried to organize about MCMC.
I tried to implement Realness GAN
I tried to move the ball
I tried to make a simple text editor using PyQt
I tried to estimate the interval.
I tried to create a linebot (implementation)
I tried to summarize Python exception handling
I tried to implement PLSA in Python
English speech recognition with python [speech to text]
I tried to implement Autoencoder with TensorFlow
I tried to summarize the umask command
I tried to implement permutation in Python
I tried to visualize AutoEncoder with TensorFlow
I tried to recognize the wake word
Python3 standard input I tried to summarize
I tried to summarize the graphical modeling.
I tried adding post-increment to CPython Implementation
I tried to implement ADALINE in Python
I tried to let optuna solve Sudoku
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to implement PPO in Python
I tried to implement CVAE with PyTorch
I tried to make a Web API
I tried to solve TSP with QAOA
[Python] I tried to calculate TF-IDF steadily
I tried to touch Python (basic syntax)
I tried my best to return to Lasso
I tried to summarize Ansible modules-Linux edition
I tried to predict Covid-19 using Darts
Voice authentication & transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text)
I tried scraping
I tried PyQ
I tried AutoKeras
I tried papermill
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried django-slack
I tried Django