TV Script Generation¶

In this project, you'll generate your own Seinfeld TV scripts using RNNs. You'll be using part of the Seinfeld dataset of scripts from 9 seasons. The Neural Network you'll build will generate a new ,"fake" TV script, based on patterns it recognizes in this training data.

Get the Data¶

The data is already provided for you in ./data/Seinfeld_Scripts.txt and you're encouraged to open that file and look at the text.

As a first step, we'll load in this data and look at some samples.

Then, you'll be tasked with defining and training an RNN to generate a new script!

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

Explore the Data¶

Play around with view_line_range to view different parts of the data. This will give you a sense of the data you'll be working with. You can see, for example, that it is all lowercase text, and each new line of dialogue is separated by a newline character \n.

view_line_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go. 

jerry: (pointing at georges shirt) see, to me, that button is in the worst possible spot. the second button literally makes or breaks the shirt, look at it. its too high! its in no-mans-land. you look like you live with your mother. 

george: are you through? 

jerry: you do of course try on, when you buy? 

george: yes, it was purple, i liked it, i dont actually recall considering the buttons.

Implement Pre-processing Functions¶

The first thing to do to any dataset is pre-processing. Implement the following pre-processing functions below:

Lookup Table
Tokenize Punctuation

Lookup Table¶

To create a word embedding, you first need to transform the words to ids. In this function, create two dictionaries:

Dictionary to go from the words to an id, we'll call vocab_to_int
Dictionary to go from the id to word, we'll call int_to_vocab

Return these dictionaries in the following tuple (vocab_to_int, int_to_vocab)

text = text.split()

import problem_unittests as tests
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    words = text
    word_counts = Counter(words)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return (vocab_to_int, int_to_vocab)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed

Tokenize Punctuation¶

We'll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

Implement the function token_lookup to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||". Create a dictionary for the following symbols where the symbol is the key and value is the token:

Period ( . )
Comma ( , )
Quotation Mark ( " )
Semicolon ( ; )
Exclamation mark ( ! )
Question mark ( ? )
Left Parentheses ( ( )
Right Parentheses ( ) )
Dash ( - )
Return ( \n )

This dictionary will be used to tokenize the symbols and add the delimiter (space) around it. This separates each symbols as its own word, making it easier for the neural network to predict the next word. Make sure you don't use a value that could be confused as a word; for example, instead of using the value "dash", try using something like "||dash||".

def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
     # Replace punctuation with tokens so we can use them in our model
        
    token_dict = {
    '.':'||PERIOD||',
    ',':'||COMMA||',
    '"':'||QUOTATION_MARK||',
    ';':'||SEMICOLON||',
    '!':'||EXCLAMATION_MARK||',
    '?':'||QUESTION_MARK||',
    '(':'||LEFT_PAREN||',
    ')':'||RIGHT_PAREN||',
    '-':'||DASH||',
    '?':'||QUESTION_MARK||',
    '\n':'||NEW_LINE||'
    }
    
    return token_dict
        
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_tokenize(token_lookup)

Tests Passed

Pre-process all the data and save it¶

Running the code cell below will pre-process all the data and save it to file. You're encouraged to lok at the code for preprocess_and_save_data in the helpers.py file to see what it's doing in detail, but you do not need to change this code.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

Check Point¶

This is your first checkpoint. If you ever decide to come back to this notebook or have to restart the notebook, you can start from here. The preprocessed data has been saved to disk.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

Build the Neural Network¶

In this section, you'll build the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions.

Check Access to GPU¶

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

Input¶

Let's start with the preprocessed input data. We'll use TensorDataset to provide a known format to our dataset; in combination with DataLoader, it will handle batching, shuffling, and other dataset iteration functions.

You can create data with TensorDataset by passing in feature and target tensors. Then create a DataLoader as usual.

data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, 
                                          batch_size=batch_size)

Batching¶

Implement the batch_data function to batch words data into chunks of size batch_size using the TensorDataset and DataLoader classes.

You can batch words using the DataLoader, but it will be up to you to create feature_tensors and target_tensors of the correct size and content for a given sequence_length.

For example, say we have these as input:

words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4

Your first feature_tensor should contain the values:

[1, 2, 3, 4]

And the corresponding target_tensor should just be the next "word"/tokenized word value:

This should continue with the second feature_tensor, target_tensor being:

[2, 3, 4, 5]  # features
6             # target

from torch.utils.data import TensorDataset, DataLoader
import random


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    # TODO: Implement function
    features = []
    targets = []

    for i in range(len(words) - sequence_length):
        features.append(words[i:i+sequence_length])
        targets.append(words[i+sequence_length])
        
    train_features = np.array(features)
    train_targets = np.array(targets)
        
    train_data = TensorDataset(torch.from_numpy(train_features), torch.from_numpy(train_targets))
    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

    # return a dataloader
    return train_loader

# there is no test for this function, but you are encouraged to create
# print statements and tests of your own

Test your dataloader¶

You'll have to modify this code to test a batching function, but it should look fairly similar.

Below, we're generating some test text data and defining a dataloader using the function you defined, above. Then, we are getting some sample batch of inputs sample_x and targets sample_y from our dataloader.

Your code should return something like the following (likely in a different order, if you shuffled your data):

torch.Size([10, 5])
tensor([[ 28,  29,  30,  31,  32],
        [ 21,  22,  23,  24,  25],
        [ 17,  18,  19,  20,  21],
        [ 34,  35,  36,  37,  38],
        [ 11,  12,  13,  14,  15],
        [ 23,  24,  25,  26,  27],
        [  6,   7,   8,   9,  10],
        [ 38,  39,  40,  41,  42],
        [ 25,  26,  27,  28,  29],
        [  7,   8,   9,  10,  11]])

torch.Size([10])
tensor([ 33,  26,  22,  39,  16,  28,  11,  43,  30,  12])

Sizes¶

Your sample_x should be of size (batch_size, sequence_length) or (10, 5) in this case and sample_y should just have one dimension: batch_size (10).

Values¶

You should also notice that the targets, sample_y, are the next value in the ordered test_text data. So, for an input sequence [ 28, 29, 30, 31, 32] that ends with the value 32, the corresponding output should be 33.

# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 43,  44,  45,  46,  47],
        [ 30,  31,  32,  33,  34],
        [ 19,  20,  21,  22,  23],
        [ 20,  21,  22,  23,  24],
        [ 44,  45,  46,  47,  48],
        [ 36,  37,  38,  39,  40],
        [ 31,  32,  33,  34,  35],
        [ 27,  28,  29,  30,  31],
        [  0,   1,   2,   3,   4],
        [ 32,  33,  34,  35,  36]])

torch.Size([10])
tensor([ 48,  35,  24,  25,  49,  41,  36,  32,   5,  37])

Build the Neural Network¶

Implement an RNN using PyTorch's Module class. You may choose to use a GRU or an LSTM. To complete the RNN, you'll have to implement the following functions for the class:

__init__ - The initialize function.
init_hidden - The initialization function for an LSTM/GRU hidden state
forward - Forward propagation function.

The initialize function should create the layers of the neural network and save them to the class. The forward propagation function will use these layers to run forward propagation and generate an output and a hidden state.

The output of this model should be the last batch of word scores after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.

Hints¶

Make sure to stack the outputs of the lstm to pass to your fully-connected layer, you can do this with lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
You can get the last batch of word scores by shaping the output of the final, fully-connected layer like so:

# reshape into (batch_size, seq_length, output_size)
output = output.view(batch_size, -1, self.output_size)
# get last batch
out = output[:, -1]

import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
        # TODO: Implement function
               
        # set class variables
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define model layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout, batch_first=True)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)    
    
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function   
        batch_size = nn_input.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # fully-connected layer
        output = self.fc(lstm_out)
        
        # reshape into (batch_size, seq_length, output_size)
        output = output.view(batch_size, -1, self.output_size)
        # get last batch
        out = output[:, -1]

        # return one batch of output word scores and the hidden state
        return out, hidden

    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # Implement function
        
        # initialize hidden state with zero weights, and move to GPU if available
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_rnn(RNN, train_on_gpu)

Tests Passed

Define forward and backpropagation¶

Use the RNN class you implemented to apply forward and back propagation. This function will be called, iteratively, in the training loop as follows:

loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)

And it should return the average loss over a batch and the hidden state returned by a call to RNN(inp, hidden). Recall that you can get this loss by computing it, as usual, and calling loss.item().

If a GPU is available, you should move your data to that GPU device, here.

def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    # TODO: Implement Function
    
    # move data to GPU, if available
    if train_on_gpu:
        inp, target = inp.cuda(), target.cuda()
        
    hidden = tuple([each.data for each in hidden])
    
    rnn.zero_grad()
    
    output, hidden = rnn(inp, hidden)
    
    # perform backpropagation
    loss = criterion(output, target)
    loss.backward()
    
    # clip gradient then optimize
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    
    optimizer.step()
    
    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), hidden

# Note that these tests aren't completely extensive.
# they are here to act as general checks on the expected outputs of your functions
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed

Neural Network Training¶

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.

Train Loop¶

The training loop is implemented for you in the train_decoder function. This function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the show_every_n_batches parameter. You'll set this parameter along with other parameters in the next section.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

Hyperparameters¶

Set and train the neural network with the following parameters:

Set sequence_length to the length of a sequence.
Set batch_size to the batch size.
Set num_epochs to the number of epochs to train for.
Set learning_rate to the learning rate for an Adam optimizer.
Set vocab_size to the number of uniqe tokens in our vocabulary.
Set output_size to the desired size of the output.
Set embedding_dim to the embedding dimension; smaller than the vocab_size.
Set hidden_dim to the hidden dimension of your RNN.
Set n_layers to the number of layers/cells in your RNN.
Set show_every_n_batches to the number of batches at which the neural network should print progress.

If the network isn't getting the desired results, tweak these parameters and/or the layers in the RNN class.

# Data params
# Sequence Length
sequence_length = 10  # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

len(vocab_to_int)

21388

# Training parameters
# Number of Epochs
num_epochs = 30
# Learning Rate
learning_rate = 0.0001

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 512
# Hidden Dimension
hidden_dim = 256
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

Train¶

In the next cell, you'll train the neural network on the pre-processed data. If you have a hard time getting a good loss, you may consider changing your hyperparameters. In general, you may get better results with larger hidden and n_layer dimensions, but larger models take a longer time to train.

You should aim for a loss less than 3.5.

You should also experiment with different sequence lengths, which determine the size of the long range dependencies that a model can learn.

from workspace_utils import active_session

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
with active_session(): 
    trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 30 epoch(s)...
Epoch:    1/30    Loss: 6.273230766296387

Epoch:    1/30    Loss: 5.385910182952881

Epoch:    1/30    Loss: 5.202624302864074

Epoch:    1/30    Loss: 5.0440158929824825

Epoch:    1/30    Loss: 4.927725498199463

Epoch:    1/30    Loss: 4.855563611030578

Epoch:    1/30    Loss: 4.781917025566101

Epoch:    1/30    Loss: 4.7339138035774235

Epoch:    1/30    Loss: 4.660841315746308

Epoch:    1/30    Loss: 4.625415425777435

Epoch:    1/30    Loss: 4.581263961315155

Epoch:    1/30    Loss: 4.526801513671875

Epoch:    1/30    Loss: 4.525625303268432

Epoch:    2/30    Loss: 4.445536102672849

Epoch:    2/30    Loss: 4.377646027565002

Epoch:    2/30    Loss: 4.341114012718201

Epoch:    2/30    Loss: 4.321851584434509

Epoch:    2/30    Loss: 4.314976773262024

Epoch:    2/30    Loss: 4.3123681807518

Epoch:    2/30    Loss: 4.335209743022919

Epoch:    2/30    Loss: 4.27544442653656

Epoch:    2/30    Loss: 4.257434257984161

Epoch:    2/30    Loss: 4.249781753540039

Epoch:    2/30    Loss: 4.251488149166107

Epoch:    2/30    Loss: 4.240350749015808

Epoch:    2/30    Loss: 4.232414435863495

Epoch:    3/30    Loss: 4.179602782423651

Epoch:    3/30    Loss: 4.127243815422058

Epoch:    3/30    Loss: 4.151471960544586

Epoch:    3/30    Loss: 4.150628784656525

Epoch:    3/30    Loss: 4.123074727535248

Epoch:    3/30    Loss: 4.10991601896286

Epoch:    3/30    Loss: 4.101238691806794

Epoch:    3/30    Loss: 4.100741771221161

Epoch:    3/30    Loss: 4.088527987957001

Epoch:    3/30    Loss: 4.095835975646972

Epoch:    3/30    Loss: 4.092186082839966

Epoch:    3/30    Loss: 4.0828693327903745

Epoch:    3/30    Loss: 4.078387490749359

Epoch:    4/30    Loss: 4.047111936890058

Epoch:    4/30    Loss: 4.00636932182312

Epoch:    4/30    Loss: 4.008693501472473

Epoch:    4/30    Loss: 3.9917088294029237

Epoch:    4/30    Loss: 3.9978073682785036

Epoch:    4/30    Loss: 3.9925049724578856

Epoch:    4/30    Loss: 3.9988902826309203

Epoch:    4/30    Loss: 3.994539538383484

Epoch:    4/30    Loss: 4.007211709976196

Epoch:    4/30    Loss: 3.984479296207428

Epoch:    4/30    Loss: 3.9818884325027466

Epoch:    4/30    Loss: 3.965790756225586

Epoch:    4/30    Loss: 3.9670524134635925

Epoch:    5/30    Loss: 3.9459750765494395

Epoch:    5/30    Loss: 3.9152674617767333

Epoch:    5/30    Loss: 3.915369257926941

Epoch:    5/30    Loss: 3.906569980621338

Epoch:    5/30    Loss: 3.8909352107048036

Epoch:    5/30    Loss: 3.911321713447571

Epoch:    5/30    Loss: 3.9015031623840333

Epoch:    5/30    Loss: 3.891183704853058

Epoch:    5/30    Loss: 3.89080774307251

Epoch:    5/30    Loss: 3.910298131465912

Epoch:    5/30    Loss: 3.9060023446083068

Epoch:    5/30    Loss: 3.892073621273041

Epoch:    5/30    Loss: 3.8857384452819823

Epoch:    6/30    Loss: 3.8896242947278252

Epoch:    6/30    Loss: 3.822801847934723

Epoch:    6/30    Loss: 3.8521574845314026

Epoch:    6/30    Loss: 3.843459441661835

Epoch:    6/30    Loss: 3.8219378123283385

Epoch:    6/30    Loss: 3.8259994425773622

Epoch:    6/30    Loss: 3.830516234397888

Epoch:    6/30    Loss: 3.844508031368256

Epoch:    6/30    Loss: 3.832982563495636

Epoch:    6/30    Loss: 3.813364098072052

Epoch:    6/30    Loss: 3.8359556946754454

Epoch:    6/30    Loss: 3.8553928809165954

Epoch:    6/30    Loss: 3.8143421626091003

Epoch:    7/30    Loss: 3.7638832838788736

Epoch:    7/30    Loss: 3.7524197325706483

Epoch:    7/30    Loss: 3.774305025577545

Epoch:    7/30    Loss: 3.7676964778900146

Epoch:    7/30    Loss: 3.7786655349731446

Epoch:    7/30    Loss: 3.745363619327545

Epoch:    7/30    Loss: 3.774412061214447

Epoch:    7/30    Loss: 3.765711848258972

Epoch:    7/30    Loss: 3.779640100479126

Epoch:    7/30    Loss: 3.761267595291138

Epoch:    7/30    Loss: 3.7896637225151064

Epoch:    7/30    Loss: 3.7864179797172546

Epoch:    7/30    Loss: 3.7790287289619444

Epoch:    8/30    Loss: 3.740227318646615

Epoch:    8/30    Loss: 3.710821409225464

Epoch:    8/30    Loss: 3.724516562461853

Epoch:    8/30    Loss: 3.7196579561233523

Epoch:    8/30    Loss: 3.712041708946228

Epoch:    8/30    Loss: 3.720858215332031

Epoch:    8/30    Loss: 3.7116080088615417

Epoch:    8/30    Loss: 3.7139327001571654

Epoch:    8/30    Loss: 3.7096731452941896

Epoch:    8/30    Loss: 3.7376085653305053

Epoch:    8/30    Loss: 3.7289879932403562

Epoch:    8/30    Loss: 3.7127652397155764

Epoch:    8/30    Loss: 3.711113497257233

Epoch:    9/30    Loss: 3.679689922815014

Epoch:    9/30    Loss: 3.6552449369430544

Epoch:    9/30    Loss: 3.666497595310211

Epoch:    9/30    Loss: 3.679211824417114

Epoch:    9/30    Loss: 3.6612080006599426

Epoch:    9/30    Loss: 3.6588699955940247

Epoch:    9/30    Loss: 3.683137204170227

Epoch:    9/30    Loss: 3.664845730304718

Epoch:    9/30    Loss: 3.6750151424407957

Epoch:    9/30    Loss: 3.6790340638160703

Epoch:    9/30    Loss: 3.668908239364624

Epoch:    9/30    Loss: 3.6667343530654906

Epoch:    9/30    Loss: 3.665269973754883

Epoch:   10/30    Loss: 3.65446248881219

Epoch:   10/30    Loss: 3.608315248966217

Epoch:   10/30    Loss: 3.622193141937256

Epoch:   10/30    Loss: 3.63651628780365

Epoch:   10/30    Loss: 3.6073988342285155

Epoch:   10/30    Loss: 3.6236666316986086

Epoch:   10/30    Loss: 3.6292623534202577

Epoch:   10/30    Loss: 3.625877299785614

Epoch:   10/30    Loss: 3.622882888317108

Epoch:   10/30    Loss: 3.624065276622772

Epoch:   10/30    Loss: 3.6579544253349305

Epoch:   10/30    Loss: 3.634135447025299

Epoch:   10/30    Loss: 3.6028599352836608

Epoch:   11/30    Loss: 3.599381180370555

Epoch:   11/30    Loss: 3.586912259101868

Epoch:   11/30    Loss: 3.555078206062317

Epoch:   11/30    Loss: 3.59083069229126

Epoch:   11/30    Loss: 3.5866366357803345

Epoch:   11/30    Loss: 3.584257958889008

Epoch:   11/30    Loss: 3.586300304889679

Epoch:   11/30    Loss: 3.5779447264671327

Epoch:   11/30    Loss: 3.5902553634643555

Epoch:   11/30    Loss: 3.5796731753349302

Epoch:   11/30    Loss: 3.5868065304756165

Epoch:   11/30    Loss: 3.5975364074707032

Epoch:   11/30    Loss: 3.605495466709137

Epoch:   12/30    Loss: 3.571661402689537

Epoch:   12/30    Loss: 3.5376559138298034

Epoch:   12/30    Loss: 3.526089183330536

Epoch:   12/30    Loss: 3.5356856265068055

Epoch:   12/30    Loss: 3.5571545877456665

Epoch:   12/30    Loss: 3.553983328342438

Epoch:   12/30    Loss: 3.5350260505676268

Epoch:   12/30    Loss: 3.5470237636566164

Epoch:   12/30    Loss: 3.5411921792030334

Epoch:   12/30    Loss: 3.557991603851318

Epoch:   12/30    Loss: 3.5522463312149046

Epoch:   12/30    Loss: 3.567534511566162

Epoch:   12/30    Loss: 3.56669424200058

Epoch:   13/30    Loss: 3.5341150007749857

Epoch:   13/30    Loss: 3.505442319393158

Epoch:   13/30    Loss: 3.4835101428031923

Epoch:   13/30    Loss: 3.4993831820487977

Epoch:   13/30    Loss: 3.4910373277664184

Epoch:   13/30    Loss: 3.5093510184288026

Epoch:   13/30    Loss: 3.514905460357666

Epoch:   13/30    Loss: 3.5092920665740968

Epoch:   13/30    Loss: 3.530417407989502

Epoch:   13/30    Loss: 3.523942120552063

Epoch:   13/30    Loss: 3.5259550633430483

Epoch:   13/30    Loss: 3.517988458156586

Epoch:   13/30    Loss: 3.5502661328315734

Epoch:   14/30    Loss: 3.488911504469912

Epoch:   14/30    Loss: 3.4561098537445067

Epoch:   14/30    Loss: 3.4686292657852174

Epoch:   14/30    Loss: 3.4645330634117126

Epoch:   14/30    Loss: 3.4843003101348877

Epoch:   14/30    Loss: 3.4982782521247864

Epoch:   14/30    Loss: 3.4725429253578186

Epoch:   14/30    Loss: 3.4794815487861634

Epoch:   14/30    Loss: 3.4917515254020692

Epoch:   14/30    Loss: 3.4804929418563844

Epoch:   14/30    Loss: 3.5041839838027955

Epoch:   14/30    Loss: 3.4928327631950378

Epoch:   14/30    Loss: 3.511185161113739

Epoch:   15/30    Loss: 3.465827794266928

Epoch:   15/30    Loss: 3.4238883209228517

Epoch:   15/30    Loss: 3.436405306816101

Epoch:   15/30    Loss: 3.4303136053085326

Epoch:   15/30    Loss: 3.4458507742881777

Epoch:   15/30    Loss: 3.4416822443008424

Epoch:   15/30    Loss: 3.4433278846740722

Epoch:   15/30    Loss: 3.4612172255516054

Epoch:   15/30    Loss: 3.441174870967865

Epoch:   15/30    Loss: 3.4735500235557555

Epoch:   15/30    Loss: 3.467753975868225

Epoch:   15/30    Loss: 3.4657823133468626

Epoch:   15/30    Loss: 3.490312624454498

Epoch:   16/30    Loss: 3.4234082939947106

Epoch:   16/30    Loss: 3.4085173468589782

Epoch:   16/30    Loss: 3.407755139350891

Epoch:   16/30    Loss: 3.4224259543418882

Epoch:   16/30    Loss: 3.3903081092834473

Epoch:   16/30    Loss: 3.425003534793854

Epoch:   16/30    Loss: 3.4094739599227903

Epoch:   16/30    Loss: 3.440275300502777

Epoch:   16/30    Loss: 3.4150911793708802

Epoch:   16/30    Loss: 3.4428480072021483

Epoch:   16/30    Loss: 3.4527617831230164

Epoch:   16/30    Loss: 3.4262153072357178

Epoch:   16/30    Loss: 3.4297962012290957

Epoch:   17/30    Loss: 3.4043313854126986

Epoch:   17/30    Loss: 3.3781729731559755

Epoch:   17/30    Loss: 3.36411678314209

Epoch:   17/30    Loss: 3.3751589879989625

Epoch:   17/30    Loss: 3.39406889295578

Epoch:   17/30    Loss: 3.3868355684280393

Epoch:   17/30    Loss: 3.3910684247016905

Epoch:   17/30    Loss: 3.391808948993683

Epoch:   17/30    Loss: 3.401913251876831

Epoch:   17/30    Loss: 3.399398877620697

Epoch:   17/30    Loss: 3.393477698802948

Epoch:   17/30    Loss: 3.4074910926818847

Epoch:   17/30    Loss: 3.4133962898254393

Epoch:   18/30    Loss: 3.375820470422168

Epoch:   18/30    Loss: 3.3174234352111815

Epoch:   18/30    Loss: 3.3419127383232117

Epoch:   18/30    Loss: 3.366801207065582

Epoch:   18/30    Loss: 3.36437966299057

Epoch:   18/30    Loss: 3.3557258610725405

Epoch:   18/30    Loss: 3.3678346791267395

Epoch:   18/30    Loss: 3.3658636503219603

Epoch:   18/30    Loss: 3.379771011829376

Epoch:   18/30    Loss: 3.3756116738319397

Epoch:   18/30    Loss: 3.3736482133865358

Epoch:   18/30    Loss: 3.3934017901420592

Epoch:   18/30    Loss: 3.384885505199432

Epoch:   19/30    Loss: 3.360419657208233

Epoch:   19/30    Loss: 3.314458378314972

Epoch:   19/30    Loss: 3.3180603528022767

Epoch:   19/30    Loss: 3.331433108329773

Epoch:   19/30    Loss: 3.33203145980835

Epoch:   19/30    Loss: 3.3332944502830504

Epoch:   19/30    Loss: 3.3376279978752135

Epoch:   19/30    Loss: 3.3327050166130068

Epoch:   19/30    Loss: 3.35538733291626

Epoch:   19/30    Loss: 3.363545600891113

Epoch:   19/30    Loss: 3.3526614603996276

Epoch:   19/30    Loss: 3.3626621589660646

Epoch:   19/30    Loss: 3.362010947704315

Epoch:   20/30    Loss: 3.3221282895012414

Epoch:   20/30    Loss: 3.286849771022797

Epoch:   20/30    Loss: 3.3081248359680178

Epoch:   20/30    Loss: 3.294210066318512

Epoch:   20/30    Loss: 3.315214951992035

Epoch:   20/30    Loss: 3.3128459343910217

Epoch:   20/30    Loss: 3.326892385959625

Epoch:   20/30    Loss: 3.3229364070892333

Epoch:   20/30    Loss: 3.316112322807312

Epoch:   20/30    Loss: 3.3214972157478333

Epoch:   20/30    Loss: 3.3298243083953856

Epoch:   20/30    Loss: 3.3273267402648927

Epoch:   20/30    Loss: 3.349983717441559

Epoch:   21/30    Loss: 3.310029291393095

Epoch:   21/30    Loss: 3.2650166296958925

Epoch:   21/30    Loss: 3.2693379836082457

Epoch:   21/30    Loss: 3.29395987033844

Epoch:   21/30    Loss: 3.2721795659065247

Epoch:   21/30    Loss: 3.2944993782043457

Epoch:   21/30    Loss: 3.2955098037719726

Epoch:   21/30    Loss: 3.28754759311676

Epoch:   21/30    Loss: 3.3000942392349244

Epoch:   21/30    Loss: 3.3186752395629884

Epoch:   21/30    Loss: 3.320007354259491

Epoch:   21/30    Loss: 3.306859487056732

Epoch:   21/30    Loss: 3.323731963634491

Epoch:   22/30    Loss: 3.2841517693979205

Epoch:   22/30    Loss: 3.2496934099197388

Epoch:   22/30    Loss: 3.2453078284263612

Epoch:   22/30    Loss: 3.2483799166679383

Epoch:   22/30    Loss: 3.2531050543785094

Epoch:   22/30    Loss: 3.2789304752349855

Epoch:   22/30    Loss: 3.265663366317749

Epoch:   22/30    Loss: 3.3005400767326356

Epoch:   22/30    Loss: 3.285008453369141

Epoch:   22/30    Loss: 3.2904725017547607

Epoch:   22/30    Loss: 3.3054687814712524

Epoch:   22/30    Loss: 3.296708556175232

Epoch:   22/30    Loss: 3.2858700346946716

Epoch:   23/30    Loss: 3.2633569708057477

Epoch:   23/30    Loss: 3.2211959600448608

Epoch:   23/30    Loss: 3.2450813088417054

Epoch:   23/30    Loss: 3.245305696487427

Epoch:   23/30    Loss: 3.2445266346931456

Epoch:   23/30    Loss: 3.2501451878547667

Epoch:   23/30    Loss: 3.2525939531326293

Epoch:   23/30    Loss: 3.2666564440727233

Epoch:   23/30    Loss: 3.251141736984253

Epoch:   23/30    Loss: 3.266198896884918

Epoch:   23/30    Loss: 3.276807692527771

Epoch:   23/30    Loss: 3.266306797981262

Epoch:   23/30    Loss: 3.261375202178955

Epoch:   24/30    Loss: 3.2385331996331153

Epoch:   24/30    Loss: 3.218121202945709

Epoch:   24/30    Loss: 3.2175164318084715

Epoch:   24/30    Loss: 3.2354626812934875

Epoch:   24/30    Loss: 3.2168924946784974

Epoch:   24/30    Loss: 3.233768678188324

Epoch:   24/30    Loss: 3.229474025726318

Epoch:   24/30    Loss: 3.2387753195762636

Epoch:   24/30    Loss: 3.227874460697174

Epoch:   24/30    Loss: 3.2356677680015564

Epoch:   24/30    Loss: 3.263615536689758

Epoch:   24/30    Loss: 3.2379907126426697

Epoch:   24/30    Loss: 3.2586484317779543

Epoch:   25/30    Loss: 3.2225245745435465

Epoch:   25/30    Loss: 3.188791670322418

Epoch:   25/30    Loss: 3.1779782032966613

Epoch:   25/30    Loss: 3.1956742238998412

Epoch:   25/30    Loss: 3.2017483239173887

Epoch:   25/30    Loss: 3.2124328336715697

Epoch:   25/30    Loss: 3.1928254714012145

Epoch:   25/30    Loss: 3.2217950801849367

Epoch:   25/30    Loss: 3.21467392206192

Epoch:   25/30    Loss: 3.2408018479347227

Epoch:   25/30    Loss: 3.21601420211792

Epoch:   25/30    Loss: 3.2399124054908754

Epoch:   25/30    Loss: 3.236413809776306

Epoch:   26/30    Loss: 3.217228324543704

Epoch:   26/30    Loss: 3.1674133138656617

Epoch:   26/30    Loss: 3.163740966320038

Epoch:   26/30    Loss: 3.18978165102005

Epoch:   26/30    Loss: 3.1800457735061647

Epoch:   26/30    Loss: 3.2108518261909484

Epoch:   26/30    Loss: 3.19913801240921

Epoch:   26/30    Loss: 3.206608341217041

Epoch:   26/30    Loss: 3.1733978691101075

Epoch:   26/30    Loss: 3.219577139377594

Epoch:   26/30    Loss: 3.2183975372314455

Epoch:   26/30    Loss: 3.225554168701172

Epoch:   26/30    Loss: 3.211917796611786

Epoch:   27/30    Loss: 3.190560873078857

Epoch:   27/30    Loss: 3.131858295917511

Epoch:   27/30    Loss: 3.1682797112464907

Epoch:   27/30    Loss: 3.152910353183746

Epoch:   27/30    Loss: 3.181439105510712

Epoch:   27/30    Loss: 3.1770586042404174

Epoch:   27/30    Loss: 3.1827981510162355

Epoch:   27/30    Loss: 3.1851818594932557

Epoch:   27/30    Loss: 3.183151349067688

Epoch:   27/30    Loss: 3.1878969688415526

Epoch:   27/30    Loss: 3.2114626083374023

Epoch:   27/30    Loss: 3.1917788491249084

Epoch:   27/30    Loss: 3.1924280648231504

Epoch:   28/30    Loss: 3.1671567105902483

Epoch:   28/30    Loss: 3.133828806877136

Epoch:   28/30    Loss: 3.1295106744766237

Epoch:   28/30    Loss: 3.1568722648620606

Epoch:   28/30    Loss: 3.1433221015930175

Epoch:   28/30    Loss: 3.133398154258728

Epoch:   28/30    Loss: 3.1717132215499877

Epoch:   28/30    Loss: 3.1592939615249636

Epoch:   28/30    Loss: 3.168725237369537

Epoch:   28/30    Loss: 3.1594085049629212

Epoch:   28/30    Loss: 3.187184054851532

Epoch:   28/30    Loss: 3.1936865868568423

Epoch:   28/30    Loss: 3.1885165133476256

Epoch:   29/30    Loss: 3.153720211564449

Epoch:   29/30    Loss: 3.1224088459014894

Epoch:   29/30    Loss: 3.1378439779281617

Epoch:   29/30    Loss: 3.1388012008666992

Epoch:   29/30    Loss: 3.1390110387802124

Epoch:   29/30    Loss: 3.132974019527435

Epoch:   29/30    Loss: 3.132197156429291

Epoch:   29/30    Loss: 3.1449569773674013

Epoch:   29/30    Loss: 3.152089892864227

Epoch:   29/30    Loss: 3.1529767155647277

Epoch:   29/30    Loss: 3.149431604385376

Epoch:   29/30    Loss: 3.1611909823417665

Epoch:   29/30    Loss: 3.1555473108291627

Epoch:   30/30    Loss: 3.145715437437359

Epoch:   30/30    Loss: 3.1111401567459107

Epoch:   30/30    Loss: 3.1049295558929444

Epoch:   30/30    Loss: 3.1290437316894533

Epoch:   30/30    Loss: 3.1039694933891298

Epoch:   30/30    Loss: 3.1298315215110777

Epoch:   30/30    Loss: 3.1187242093086245

Epoch:   30/30    Loss: 3.1333561611175536

Epoch:   30/30    Loss: 3.1541508026123046

Epoch:   30/30    Loss: 3.127814531326294

Epoch:   30/30    Loss: 3.155260736465454

Epoch:   30/30    Loss: 3.1538154802322387

Epoch:   30/30    Loss: 3.148553409576416

/opt/conda/lib/python3.6/site-packages/torch/serialization.py:193: UserWarning: Couldn't retrieve source code for container of type RNN. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model Trained and Saved

Question: How did you decide on your model hyperparameters?¶

For example, did you try different sequence_lengths and find that one size made the model converge faster? What about your hidden_dim and n_layers; how did you decide on those?

Answer:

I experimented with different sequence lengths, batch sizes, embedding dimension, hidden dimension, lstm layers and learning rates.

I found that the model trains well when the sequence length is 10 or under. Initially I tried out large embedding and hidden dimensions (in the thousands) because the input dimension is so large, but only managed to train the model when the embedding and hidden dimensions were cut down to around 512 and 256, respectively. Initially I tried a learning rate of 0.001, but found that 0.0001 reduces the loss more consistently. Also, I tried batch sizes of 32, 64 and 128 and had the most success with 128.

Checkpoint¶

After running the above training cell, your model will be saved by name, trained_rnn, and if you save your notebook progress, you can pause here and come back to this code at another time. You can resume your progress by running the next cell, which will load in our word:id dictionaries and load in your saved model by name!

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

Generate TV Script¶

With the network trained and saved, you'll use it to generate a new, "fake" Seinfeld TV script in this section.

Generate Text¶

To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. You'll be using the generate function to do this. It takes a word id to start with, prime_id, and generates a set length of text, predict_len. Also note that it uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

Generate a New Script¶

It's time to generate the text. Set gen_length to the length of TV script you want to generate and set prime_word to one of the following to start the prediction:

"jerry"
"elaine"
"george"
"kramer"

You can set the prime word to any word in our dictionary, but it's best to start with a name for generating a TV script. (You can also start with any other names you find in the original text file!)

# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:42: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

jerry: what is that?

jerry: i don't know what the hell is that. i just think that you're a very interesting idea.

kramer: hey! hey, you guys are gonna go to the airport.

jerry: i think i'm a little concerned.

jerry: you don't think i was going to do it?

kramer: oh no, i just wanted you to know how much you can do it?

jerry:(to elaine, he turns in the hallway. kramer enters.)

george:(to the woman) hey, what happened to you?

jerry: i don't know what i'm going to do.

george: i don't know.

elaine: you know, i'm sorry. i just, uh, i was just a little bit of a friend, i'm not gonna get going.

elaine: you know, i think it's a very good thing about the car, jerry. i'm a little man.(jerry enters)

jerry:(to george) i don't think so.

jerry:(still on the phone) oh yeah, yeah. yeah. yeah.

elaine: well, it's just a lot of people work in the house.

jerry:(to george) you know, i don't like the job, you don't think so, what are you doing there? you got it?

jerry: well, it's not a very popular man.

george: i don't understand.

jerry: you don't know how you were gonna go out of the bathroom?

george: well, you were not a comedian.

kramer: i know you don't know.

jerry: you mean i don't know what you said to me, but i can't do this.

elaine: well...

elaine: i think i should.

george: you know, the only thing is about that, i think it's about the worst thing that i was going to be the first one of my life.

george: what?

Save your favorite scripts¶

Once you have a script that you like (or find interesting), save it to a text file!

# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()

The TV Script is Not Perfect¶

It's ok if the TV script doesn't make perfect sense. It should look like alternating lines of dialogue, here is one such example of a few generated lines.

Example generated script¶

jerry: what about me?

jerry: i don't have to wait.

kramer:(to the sales table)

elaine:(to jerry) hey, look at this, i'm a good doctor.

newman:(to elaine) you think i have no idea of this...

elaine: oh, you better take the phone, and he was a little nervous.

kramer:(to the phone) hey, hey, jerry, i don't want to be a little bit.(to kramer and jerry) you can't.

jerry: oh, yeah. i don't even know, i know.

jerry:(to the phone) oh, i know.

kramer:(laughing) you know...(to jerry) you don't know.

You can see that there are multiple characters that say (somewhat) complete sentences, but it doesn't have to be perfect! It takes quite a while to get good results, and often, you'll have to use a smaller vocabulary (and discard uncommon words), or get more data. The Seinfeld dataset is about 3.4 MB, which is big enough for our purposes; for script generation you'll want more than 1 MB of text, generally.

Submitting This Project¶

When submitting this project, make sure to run all the cells before saving the notebook. Save the notebook file as "dlnd_tv_script_generation.ipynb" and save another copy as an HTML file by clicking "File" -> "Download as.."->"html". Include the "helper.py" and "problem_unittests.py" files in your submission. Once you download these files, compress them into one zip file for submission.