Loading data for a Language Model


In [1]:
%install-location $cwd/swift-install
%install-swiftpm-flags -c release
%install '.package(url: "https://github.com/tensorflow/swift-models", .branch("master"))' ModelSupport Datasets


Installing packages:
	.package(url: "https://github.com/tensorflow/swift-models", .branch("master"))
		ModelSupport
		Datasets
With SwiftPM flags: ['-c', 'release']
Working in: /tmp/tmpcqr55gh1/swift-install
[1/4] Compiling STBImage stb_image_write.c
[2/4] Compiling STBImage stb_image.c
[3/4] Compiling SwiftProtobuf AnyMessageStorage.swift
/content/swift-install/package/.build/checkouts/swift-protobuf/Sources/SwiftProtobuf/BinaryDelimited.swift:198:7: warning: variable 'readBuffer' was never mutated; consider changing to 'let' constant
  var readBuffer = UnsafeMutablePointer<UInt8>.allocate(capacity: 1)
  ~~~ ^
  let
[4/5] Compiling ModelSupport BijectiveDictionary.swift
[5/6] Compiling Datasets BostonHousing.swift
[6/7] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift
[7/7] Linking libjupyterInstalledPackages.so
Initializing Swift...
Installation complete!

In [2]:
import TensorFlow
import Foundation
import ModelSupport
import Datasets

Downloading the data

First thing first, we will need to download the data somewhere. We use DatasetUtilities for this. You just need to split the url where the archived file is between the filename, extension and rest of the host url, then specify the folder where you want it downloaded. The function .downloadResource will then automatically download the archive (if needed) and inflate it (in the process the folder you specified will be created if it didn't exist).


In [3]:
let cwdURL = URL(fileURLWithPath: FileManager.default.currentDirectoryPath)
let dataFolder = DatasetUtilities.downloadResource(
    filename: "wikitext-2", 
    fileExtension: "tgz",
    remoteRoot: URL(string: "https://s3.amazonaws.com/fast-ai-nlp/")!,
    localStorageDirectory: cwdURL.appendingPathComponent("data/", isDirectory: true)
)


Loading resource: wikitext-2
File does not exist locally at expected path: /content/data/wikitext-2 and must be fetched
Fetching URL: https://s3.amazonaws.com/fast-ai-nlp/wikitext-2.tgz...
Archive saved to: /content/data

In [4]:
var trainTexts = try! String(contentsOf: dataFolder.appendingPathComponent("train.csv"), encoding: .utf8)

In [5]:
public func readCSV(in file: URL) -> [String] {
    let rawText = try! String(contentsOf: file, encoding: .utf8)
    var rows = rawText.components(separatedBy: "\"\n\"")
    //Removing the initial "
    rows[0] = String(rows[0].dropFirst())
    //Removing the last "\n
    rows[rows.indices.last!] = String(rows.last!.dropLast(2))
    return rows
}

In [6]:
let trainTexts = readCSV(in: dataFolder.appendingPathComponent("train.csv"))
let validTexts = readCSV(in: dataFolder.appendingPathComponent("test.csv"))

In [7]:
trainTexts[0]


Out[7]:
" \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation zone on goal difference , before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two . This meant York qualified for the play @-@ offs , and they were eliminated in the semi @-@ final by Fleetwood Town . York were knocked out of the 2013 – 14 FA Cup , Football League Cup and Football League Trophy in their opening round matches . \n 35 players made at least one appearance in nationally organised first @-@ team competition , and there were 12 different <unk> . Defender Ben Davies missed only five of the fifty @-@ two competitive matches played over the season . Wes Fletcher finished as leading scorer with 13 goals , of which 10 came in league competition and three came in the FA Cup . The winner of the <unk> of the Year award , voted for by the club \'s supporters , was <unk> Oyebanjo . \n \n = = Background and pre @-@ season = = \n \n The 2012 – 13 season was York City \'s first season back in the Football League , having won the Conference Premier play @-@ offs in 2011 – 12 after <unk> years in the Football Conference . Manager Gary Mills was sacked in March 2013 following an 11 @-@ match run without a victory , and was replaced by former Northern Ireland manager Nigel Worthington . Despite being in the relegation zone with three matches remaining , Worthington led the team to safety from relegation after a 1 – 0 win away to Dagenham & Redbridge on the final day of the season . York finished the season in 17th @-@ place in the 2012 – 13 League Two table . \n Following the previous season \'s conclusion Lee <unk> , Jon <unk> , Chris <unk> , Ben Everson , Scott Kerr , David <unk> , Patrick <unk> , Michael Potts , Jamie Reed and Jason Walker were released by York , while <unk> Blair departed for Fleetwood Town . David McGurk , <unk> Oyebanjo , Danny Parslow , Tom Platt and Chris Smith signed new contracts with the club . New players signed ahead of the start of the season were goalkeeper Chris <unk> on a season @-@ long loan from Blackpool , defender Ben Davies on loan from Preston North End , midfielders Craig Clay from Chesterfield and Lewis Montrose from Gillingham , winger <unk> Puri from St <unk> and strikers Ryan Bowman from Hereford United , Richard Cresswell from Sheffield United , Wes Fletcher from Burnley and Ryan Jarvis from Torquay United . Defender Mike Atkinson and striker Chris Dickinson entered the first @-@ team squad from the youth team after agreeing professional contracts . \n York retained the previous season \'s home and away kits . The home kit comprised red shirts with white sleeves , light blue shorts and white socks . The away kit included light blue shirts with white sleeves , white shorts and light blue socks . <unk> Health continued as shirt sponsors for the second successive season . \n \n = = Review = = \n \n \n = = = August = = = \n \n York began the season with a 1 – 0 home win over the previous season \'s play @-@ off finalists , Northampton Town , with <unk> Jarvis scoring the winning goal in the 90th @-@ minute . However , defeat came in York \'s match against Championship side Burnley in the first round of the League Cup , going down 4 – 0 at home . The team endured their first league defeat of the season in the following game after being beaten 2 – 0 away by Dagenham & Redbridge , the home team scoring in each half . York then held Hartlepool United to a 0 – 0 home draw , before being beaten 3 – 2 away by Bristol Rovers , in which Jarvis scored twice before John @-@ Joe O \'Toole scored the winning goal for the home team in the 67th @-@ minute . Two signings were made shortly before the transfer deadline ; defender George Taft was signed on a one @-@ month loan from Leicester City , while Middlesbrough midfielder Ryan Brobbel joined on a one @-@ month loan . <unk> John <unk> , who had been told he had no future with the club , departed after signing for FC Halifax Town . Jarvis gave York the lead away at Exeter City before Alan <unk> scored in each half to see the home team win 2 – 1 . \n \n = = = September = = = \n \n York suffered their first home league defeat of the season after AFC Wimbledon won 2 – 0 , with Michael Smith scoring in each half . Former Ipswich Town midfielder Josh Carson , who had a spell on loan with York the previous season , signed a contract until the end of 2013 – 14 and Sheffield United midfielder Elliott <unk> signed on a one @-@ month loan . Brobbel opened the scoring in the second minute of his home debut against Mansfield Town , although the away team went on to score twice to win 2 – 1 . York \'s run of four defeats ended following a 1 – 1 draw away to Wycombe Wanderers , in which McGurk gave York the lead before the home team levelled through Dean Morgan . Taft was sent back to Leicester after he fell behind McGurk , Parslow and Smith in the pecking order for a central defensive berth . York achieved their first win since the opening day of the season after beating Portsmouth 4 – 2 at home , with Fletcher ( 2 ) , Montrose and Jarvis scoring . \n \n = = = October = = = \n \n Defender Luke O \'Neill was signed from Burnley on a 28 @-@ day emergency loan . He made his debut in York \'s 3 – 0 win away at Torquay , which was the team \'s first successive win of the season . York were knocked out of the Football League Trophy in the second round after being beaten 3 – 0 at home by League One team Rotherham United , before their winning streak in the league was ended with a 3 – 0 defeat away to Newport County . York drew 2 – 2 away to Chesterfield , having taken a two @-@ goal lead through O \'Neill and Jarvis , before the home team fought back through Armand <unk> and Jay O <unk> . The team then hosted Fleetwood Town , and the visitors won 2 – 0 with goals scored in each half by Gareth Evans and <unk> Matt . Scunthorpe United were beaten 4 – 1 at home to end York \'s three @-@ match run without a win , with all the team \'s goals coming in the first half from Carson , Fletcher and Brobbel ( 2 ) . \n \n = = = November = = = \n \n Bowman scored his first goals for York away to Cheltenham Town , as York twice fought back from behind to draw 2 – 2 . York drew 3 – 3 away to Bristol Rovers to earn a first round replay in the FA Cup , taking the lead through Jarvis before Eliot Richards equalised for the home team . Carson scored a 30 yard volley to put York back in the lead , and after Bristol Rovers goals from Matt <unk> and Chris <unk> , Fletcher scored an 86th @-@ minute equaliser for York . Bowman scored with a header from an O \'Neill cross to open the scoring at home to Plymouth Argyle , which was the first goal the visitors had conceded in 500 minutes of action . However , Plymouth equalised 11 minutes later through <unk> <unk> and the match finished a 1 – 1 draw . York were knocked out of the FA Cup after losing 3 – 2 at home to Bristol Rovers in a first round replay ; the visitors were 3 – 0 up by 50 @-@ minutes before Fletcher pulled two back for York with a penalty and a long @-@ range strike . \n Defender Keith Lowe , of Cheltenham , and goalkeeper Nick Pope , of Charlton Athletic , were signed on loan until January 2014 . They both played in York \'s first league defeat in four weeks , 2 – 1 away , to Southend United . <unk> <unk> gave Southend the lead early into the match and Bowman equalised for York with a low strike during the second half , before Luke Prosser scored the winning goal for the home side in stoppage time . With Pope preferred in goal , <unk> returned to Blackpool on his own accord , although his loan agreement would stay in place until January 2014 . York then drew 0 – 0 away to Morecambe . After Pope was recalled from his loan by Charlton , York signed Wolverhampton Wanderers goalkeeper Aaron McCarey on loan until January 2014 . McCarey kept a clean sheet in York \'s 0 – 0 home draw with Rochdale . \n \n = = = December = = = \n \n Cresswell retired from playing as a result of an eye complaint and a knee injury . York drew 1 – 1 away to Burton Albion , with an own goal scored by Shane <unk> @-@ <unk> giving York the lead in the 64th @-@ minute before the home team equalised eight minutes later through Billy <unk> . Atkinson was released after failing to force himself into the first team and signed for Scarborough Athletic , with whom he had been on loan . York drew 0 – 0 at home with second @-@ placed Oxford United , in which Carson came closest to scoring with a volley that <unk> across the face of the goal . This was followed by another draw after the match away to Accrington Stanley finished 1 – 1 , with the home team <unk> 10 minutes after a Fletcher penalty had given York the lead in the 35th @-@ minute . Striker <unk> McDonald , who had been released by Peterborough United , was signed on a contract until the end of the season . York \'s last match of 2013 was a 2 – 1 defeat away at Bury , a result that ended York \'s run of consecutive draws at five . The home team were 2 – 0 up by the 19th @-@ minute , before Michael Coulson scored York \'s goal in the 73rd @-@ minute . This result meant York would begin 2014 in 22nd @-@ position in the table , only out of the relegation zone on goal difference . \n \n = = = January = = = \n \n Jarvis scored the only goal in York \'s first win since October 2013 , a 1 – 0 home victory over Morecambe on New Year \'s Day . McCarey was recalled by Wolverhampton Wanderers due to an injury to one of their <unk> , while O \'Neill was recalled by Burnley to take part in their FA Cup match . York achieved back @-@ to @-@ back wins for the first time since October 2013 after Dagenham & Redbridge were beaten 3 – 1 at home , with Bowman opening the scoring in the second half before Fletcher scored twice . Adam Reed , who had a spell on loan with York in the previous season , was signed on a contract until the end of the season after parting company with Burton . Davies \' loan was extended , while Brobbel and <unk> returned to their parent clubs . Cheltenham club captain Russell Penn , a midfielder , was signed on a two @-@ and @-@ a @-@ half @-@ year contract for an undisclosed fee . Lowe was subsequently signed permanently from Cheltenham on a two @-@ and @-@ a @-@ half @-@ year contract for an undisclosed fee . Having been allowed to leave the club on a free transfer , Ashley Chambers signed for Conference Premier club Cambridge United . \n York achieved three successive wins for the first time in 2013 – 14 after beating Northampton 2 – 0 away , with Bowman and Fletcher scoring in three @-@ second half minutes . Defender John McCombe was signed on a two @-@ and @-@ a @-@ half @-@ year contract following his release from Mansfield , before Clay and Jamal <unk> left York by mutual consent . Pope returned to York on loan from Charlton for the remainder of the season . York \'s run of wins ended with a 0 – 0 draw at home to Bristol Rovers , before their first defeat of the year came after losing 2 – 0 away to Hartlepool . Preston winger Will Hayhurst , a Republic of Ireland under @-@ 21 international , was signed on a one @-@ month loan . York fell to a successive defeat for the first time since September 2013 after being beaten 2 – 0 at home by Chesterfield . Shortly after the match , Smith left the club by mutual consent to pursue first @-@ team football . \n \n = = = February = = = \n \n Fletcher scored a 90th @-@ minute winner for York away to Fleetwood in a 2 – 1 win , a result that ended Fleetwood \'s five @-@ match unbeaten run . York then drew 0 – 0 at home to fellow mid @-@ table team Cheltenham , before beating Plymouth 4 – 0 away with goals from Fletcher , McCombe ( 2 ) and Carson as the team achieved successive away wins for the first time in 2013 – 14 . York went without scoring for a fourth consecutive home match after drawing 0 – 0 with Southend . Having worn the <unk> since an injury to McGurk , Penn was appointed captain for the rest of the season , a position that had earlier been held by Smith and Parslow . \n \n = = = March = = = \n \n York achieved their first home win in five matches after beating Exeter 2 – 1 , with first half goals scored by McCombe and Coulson . Hayhurst \'s loan was extended to the end of the season , having impressed in his six appearances for the club . Coulson scored again with the only goal , a 41st @-@ minute header , in York \'s 1 – 0 away win over AFC Wimbledon . Bowman scored the only goal with a 32nd @-@ minute penalty as York won 1 – 0 away against Mansfield , in which Fletcher missed the opportunity to extend the lead when his stoppage time penalty was saved by Alan Marriott . York moved one place outside the play @-@ offs with a 2 – 0 home win over Wycombe , courtesy of a second Bowman penalty in as many matches and a Carson goal from the edge of the penalty area . Coulson scored York \'s only goal in a 1 – 0 away win over struggling Portsmouth with a low volley in the fifth @-@ minute ; this result meant York moved into the play @-@ offs in seventh @-@ place with eight fixtures remaining . \n Striker Calvin Andrew , who had been released by Mansfield in January 2014 , was signed on a contract for the remainder of the season . He made his debut as a substitute in York \'s 1 – 0 home win over bottom of the table Torquay , in which Hayhurst scored the only goal in the 11th @-@ minute with an 18 yard shot that <unk> off Aaron <unk> . Middlesbrough winger Brobbel rejoined on loan until the end of the season , following an injury to Carson . York \'s run of successive wins ended on six matches after a 0 – 0 home draw with Burton , and this result saw York drop out of the play @-@ offs in eighth @-@ place . With the team recording six wins and one draw in March 2014 , including six clean sheets , Worthington was named League Two Manager of the Month . \n \n = = = April = = = \n \n Pope made a number of saves as York held league leaders Rochdale to a 0 – 0 away draw , with a point being enough to lift the team back into seventh @-@ place . York were prevented from equalling a club record of eight consecutive clean sheets when Accrington scored a stoppage time equaliser in a 1 – 1 home draw , in which York had taken earlier taken the lead with a Coulson penalty . A 1 – 0 win away win over Oxford , which was decided by a second half Coulson penalty , resulted in York moving one place above their opponents and back into seventh @-@ place . York consolidated their place in a play @-@ off position after beating Bury 1 – 0 at home with a fifth @-@ minute goal scored by Lowe from a Hayhurst corner . The result meant York opened up a five @-@ point lead over eighth @-@ placed Oxford with two fixtures remaining . A place in the League Two play @-@ offs was secured following a 1 – 0 win over Newport at home , in which Coulson scored the only goal in the 77th @-@ minute with a 25 yard free kick . Pope earned a nomination for League Two Player of the Month for April 2014 , having conceded only one goal in five matches in that period . \n \n = = = May = = = \n \n The league season concluded with an away match against divisional runners @-@ up Scunthorpe ; having gone two goals down York fought back to draw 2 – 2 with goals scored by Brobbel and Andrew . This result meant York finished the season in seventh @-@ place in League Two , and would thus play fourth @-@ placed Fleetwood in the play @-@ off semi @-@ final on the back of a 17 @-@ match unbeaten run . York lost 1 – 0 to Fleetwood in the first leg at <unk> Crescent ; the goal came from former York player <unk> Blair in the 50th @-@ minute , who scored from close range after Antoni <unk> \'s shot was blocked on the line . A 0 – 0 draw away to Fleetwood in the second leg meant York were eliminated 1 – 0 on aggregate , ending the prospect of a second promotion in three seasons . At an awards night held at York Racecourse , Oyebanjo was voted <unk> of the Year for 2013 – 14 . \n \n = = Summary and aftermath = = \n \n York mostly occupied the bottom half of the table before the turn of the year , and dropped as low as 23rd in September 2013 . During February 2014 the team broke into the top half of the table and with one match left were in sixth @-@ place . York \'s defensive record was the third best in League Two with 41 goals conceded , bettered only by Southend ( 39 ) and Chesterfield ( 40 ) . Davies made the highest number of appearances over the season , appearing in 47 of York \'s 52 matches . Fletcher was York \'s top scorer in the league and in all competitions , with 10 league goals and 13 in total . He was the only player to reach double figures , and was followed by Jarvis with nine goals . \n After the season ended York released Tom Allan , Andrew , Dickinson , McDonald , Puri and Reed , while McGurk retired from professional football . Bowman and Oyebanjo left to sign for Torquay and Crawley Town respectively while Coulson signed a new contract with the club . York \'s summer signings included goalkeeper Jason <unk> from Tranmere Rovers , defenders <unk> <unk> from Dagenham , Marvin McCoy from Wycombe and Dave Winfield from Shrewsbury Town , midfielders <unk> <unk> from Mansfield , Anthony <unk> from Southend and Luke <unk> from Shrewsbury and striker Jake Hyde from <unk> . \n \n = = Match details = = \n \n League positions are sourced by <unk> , while the remaining information is referenced individually . \n \n = = = Football League Two = = = \n \n \n = = = League table ( part ) = = = \n \n \n = = = FA Cup = = = \n \n \n = = = League Cup = = = \n \n \n = = = Football League Trophy = = = \n \n \n = = = Football League Two play @-@ offs = = = \n \n \n = = <unk> = = \n \n \n = = = In = = = \n \n <unk> around club names denote the player \'s contract with that club had expired before he joined York . \n \n = = = Out = = = \n \n <unk> around club names denote the player joined that club after his York contract expired . \n \n = = = Loan in = = = \n \n \n = = = Loan out = = = \n \n \n = = Appearances and goals = = \n \n Source : \n Numbers in parentheses denote appearances as substitute . \n Players with names struck through and marked left the club during the playing season . \n Players with names in italics and marked * were on loan from another club for the whole of their season with York . \n Players listed with no appearances have been in the <unk> squad but only as unused <unk> . \n Key to positions : <unk> – <unk> ; <unk> – Defender ; <unk> – <unk> ; <unk> – Forward \n \n"

From texts to numbers

A model won't be able to train on raw texts like the one above. We will need to convert it into numbers first. To do this, there are two different steps: transforming a text into a list of words (called tokens) and then transforming those words in numbers. Those steps are usually called tokenization and numericalization in NLP.

Tokenization

Tokenizing a text is converting it into a list of meaningful tokens. There are several way to do this:

  • character-level tokenization just splits the texts in an array of characters
  • word-level tokenization splits the texts by words or punctuation symbols
  • subword-level tokenization splits the texts by subwords (particularly useful in languages like Turkish or German where you can build longer words by adding prefixes or suffixes)

While character-level tokenization is pretty straightforward, the two other kinds are a bit trickier. How do you split a word like "don't" for instance, which is actually "do not"? In our case, we don't ahve to worry about that since wikitext-103 has been pre-tokenized, so we can just split on space. (Alternatively, we could train a BPE tokenizer on those texts.)


In [8]:
func easyTokenize(_ text: String) -> [String] {
    return text.components(separatedBy: " ")
}

In [9]:
let trainTokenizedTexts = trainTexts.map(easyTokenize)
let validTokenizedTexts = validTexts.map(easyTokenize)

Numericalization

One our texts are splits into tokens, we can make a mapping token to unique index and convert them into numbers. We usually try to limit the size of the vocabulary by keeping only the most common tokens, or removing the tokens that are only present more than a given number of times. All tokens that are not part of the vocabulary will be changed to <unk> (for unkown).

So first, let's count how many times each token is used in our texts. We also save the length of each text since we will need that later on.


In [10]:
func countTokens(_ texts: [[String]]) -> ([Int], [String:Int]) {
    var counts: [String:Int] = [:]
    var lengths: [Int] = []
    for tokens in texts {
        lengths.append(tokens.count)
        for token in tokens {
            counts[token] = (counts[token] ?? 0) + 1
        }
    }
    return (lengths,counts)
}

We only use the training set to build our vocabulary.


In [11]:
let (trainLengths, trainCounts) = countTokens(trainTokenizedTexts)

Then the following function will create a vocabulary containing all the most frequent words up to maxCount, and with a minimum frequency of minFrequency (NB: a language model can barely learn anything about words rarely present in the dataset). We return a tuple with the two mappings int to string and string to int (often called itos and stoi in NLP).


In [12]:
func makeVocabulary(
    _ counts: [String:Int], 
    minFrequency: Int = 2, 
    maxCount: Int = 60000) 
-> (itos: [Int:String], stoi: [String:Int]) {
    let withoutSpec = counts.filter { $0.0 != "xxunk" && $0.0 != "xxpad" }
    let sorted = withoutSpec.sorted { $0.1 > $1.1 }
    var itos: [Int:String] = [0:"xxunk", 1:"xxpad"]
    var stoi: [String:Int] = ["xxunk":0, "xxpad":1]
    for (i,x) in sorted.enumerated() {
        if i+2 >= maxCount || x.1 < minFrequency { break }
        itos[i+2] = (x.0)
        stoi[x.0] = i+2
    }
    return (itos: itos, stoi: stoi)
}

Let's use our previous counts to build a vocabulary:


In [13]:
let vocabulary = makeVocabulary(trainCounts)

And then we can use it to numericalize our tokenized texts, let's just check what is the index of the unknown token to use it for words that are our of vocabulary.


In [14]:
vocabulary.stoi["<unk>"]


Out[14]:
▿ Optional<Int>
  - some : 6

In [15]:
func numericalize(_ tokens: [String], with stoi: [String:Int]) -> [Int] {
    return tokens.map { stoi[$0] ?? 6 }
}

And we can apply it to all our tokenized texts:


In [16]:
let trainNumericalizedTexts = trainTokenizedTexts.map{ numericalize($0, with: vocabulary.stoi) }
let validNumericalizedTexts = validTokenizedTexts.map{ numericalize($0, with: vocabulary.stoi) }

Understanding the language model dataset

A language model task is to guess the next word in a stream of texts. When having a list of tokenized and numericalized texts, we usually concatenate them all together in one big stream, separate it in the desired numbers of batches (which are batchSize chunks of continuous texts) then read through those sequenceLength at a time.

Let's look at an example:


In [17]:
let items = [[0,1,2,3,4],[5,6,7,8,9,10],[11,12,13,14,15,16,17,18],[19,20],[21,22]]

In [18]:
var dataset = LanguageModelDataset(batchSize: 4, sequenceLength: 3, numericalizedTexts: items)

Here our stream is the sequence of integers from 0 to 22. With a batchsize of 4, we split it in four chunks which are:

0,1,2,3,4
5,6,7,8,9
10,11,12,13,14
15,16,17,18,19

The last three bits of the stream are thrown away because we don't have a round multiple of 4.

Then if read with a sequenceLength of 3, the first batch has for input

0,1,2
5,6,7
10,11,12
15,16,17

and for target the next words:

1,2,3
6,7,8
11,12,13
16,17,18

Let's put our dataset in batches to check it does all of this for us:


In [19]:
let inBatches = dataset.inBatches(of: 4).lazy.map {
  (
    data: Tensor<Int32>($0.map(\.first)),
    label: Tensor<Int32>($0.map(\.second))
  )
}

In [20]:
for x in inBatches { print(x) }


(data: [[ 0,  1,  2],
 [ 5,  6,  7],
 [10, 11, 12],
 [15, 16, 17]], label: [[ 1,  2,  3],
 [ 6,  7,  8],
 [11, 12, 13],
 [16, 17, 18]])
(data: [[ 3,  4],
 [ 8,  9],
 [13, 14],
 [18, 19]], label: [[ 4,  5],
 [ 9, 10],
 [14, 15],
 [19, 20]])

The first batch is as expected, and the second one has only a sequence length of 2 because our big chunks of text have a length of 5 here.

Behind the scenes, LanguageModelDataset implements a new collection which has the proper length and subscrit, to return the pair input/target of text (and not the raw texts of varying lengths).

With the shuffle enabled, the texts are shuffled before being concatenated to form the stream.


In [21]:
dataset.shuffle()

let inBatches = dataset.inBatches(of: 4).lazy.map {
  (
    data: Tensor<Int32>($0.map(\.first)),
    label: Tensor<Int32>($0.map(\.second))
  )
}

In [22]:
for x in inBatches { print(x) }


(data: [[19, 20, 11],
 [14, 15, 16],
 [21, 22,  5],
 [ 8,  9, 10]], label: [[20, 11, 12],
 [15, 16, 17],
 [22,  5,  6],
 [ 9, 10,  0]])
(data: [[12, 13],
 [17, 18],
 [ 6,  7],
 [ 0,  1]], label: [[13, 14],
 [18, 21],
 [ 7,  8],
 [ 1,  2]])

Applying it to our texts

We can create a LanguageModelDataset from all our text. Since it will need all the lengths of every sample to work, we can provide the array of lengths of each text to speed up the init (if we don't, it will make a pass over the dataset to compute them).


In [23]:
let trainSet = LanguageModelDataset(
    batchSize: 64, 
    sequenceLength: 72, 
    numericalizedTexts: trainNumericalizedTexts, 
    lengths: trainLengths
)

In [24]:
let validSet = LanguageModelDataset(
    batchSize: 64, 
    sequenceLength: 72, 
    numericalizedTexts: validNumericalizedTexts
)

And we can batch our samples with Epochs APIs. Let's start with defining some typealias for convenience.


In [25]:
typealias Samples = LanguageModelDataset<[[Int]]>
typealias LabeledTextBatch = (data: Tensor<Int32>, label: Tensor<Int32>)
typealias Batches = Slices<Sampling<Samples, ArraySlice<Int>>>
typealias Training = LazyMapSequence<
    TrainingEpochs<Samples, SystemRandomNumberGenerator>, 
    LazyMapSequence<Batches, LabeledTextBatch>
  >
typealias Validation = LazyMapSequence<
    Slices<Samples>, 
    LabeledTextBatch
  >

In [26]:
let training: Training = TrainingEpochs(
  samples: trainSet, 
  batchSize: 64, 
  entropy: SystemRandomNumberGenerator()
).lazy.map { (batches: Batches) -> LazyMapSequence<Batches, LabeledTextBatch> in
  batches.lazy.map {
    (
      data: Tensor<Int32>($0.map(\.first)),
      label: Tensor<Int32>($0.map(\.second))
    )
  }
}

In [27]:
let validation: Validation = validSet.inBatches(of: 64).lazy.map {
  (
    data: Tensor<Int32>($0.map(\.first)),
    label: Tensor<Int32>($0.map(\.second))
  )
}

To iterate through our training batches, we just use .enumerated(). Here let's check we do read through the texts in order by storing the first five batches.


In [28]:
var sampleTrainingBatches: [LabeledTextBatch] = []
for (epoch, epochBatches) in training.prefix(1).enumerated() {
  for batch in epochBatches {
    sampleTrainingBatches.append(batch)
    if sampleTrainingBatches.count >= 5 {
      break
    }
  }
}

Iterating through our validation batches is even easier:


In [29]:
var sampleValidationBatches: [LabeledTextBatch] = []
for batch in validation {
  sampleValidationBatches.append(batch)
  if sampleValidationBatches.count >= 5 {
    break
  }
}

To show one of the lines of our tensor, we will use this function:


In [30]:
func showText(_ x: Tensor<Int32>) -> String {
    var tokens = x.scalars.map { vocabulary.itos[Int($0)]! }
    return tokens.joined(separator: " ")
}

Now let's look at the first row of our first batch:


In [31]:
showText(sampleTrainingBatches[0].data[0])


Out[31]:
"The US hoped the generals would <unk> because they could not survive and be able to repel the communists or rival officers without aid from Washington . On the other hand , Khánh and the Young Turks expected the Americans would become more worried about the communist gains first and <unk> to their fait <unk> against the HNC . The generals were correct . \n The South Vietnamese eventually had their way"

The targets are just shifted one word to the right:


In [32]:
showText(sampleTrainingBatches[0].label[0])


Out[32]:
"US hoped the generals would <unk> because they could not survive and be able to repel the communists or rival officers without aid from Washington . On the other hand , Khánh and the Young Turks expected the Americans would become more worried about the communist gains first and <unk> to their fait <unk> against the HNC . The generals were correct . \n The South Vietnamese eventually had their way ."

Since validation set is not shuttled, we should be able to pick up a pair of texts where second one is exactly where the first one stopped:


In [33]:
showText(sampleValidationBatches[0].data[1])


Out[33]:
"\n \n In 2011 , the <unk> utilized a committee of running backs , with <unk> , Daniel Porter , and Jerome <unk> all receiving significant playing time . <unk> was used mostly in short @-@ yardage situations on the ground , while also being active as a receiver and on special teams . He played in 18 games , made eight starts , and finished with <unk> yards on 52 carries"

In [34]:
showText(sampleValidationBatches[1].data[1])


Out[34]:
"with no touchdowns . He also caught 22 passes for 150 yards and a touchdown . <unk> played in both of the <unk> \' playoff games . In the West <unk> @-@ Finals against the <unk> , he rushed for a goal @-@ line touchdown , in addition to making three receptions and two special @-@ teams tackles . <unk> played a more limited role in the West Finals against the BC"