Building a corpus from individual files

Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a \pos\ directory while negative reviews live under \neg\. Refer to moviereviesREADME.txt for more information about the files.

We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:

  • using Python's os module to build a pandas DataFrame
  • using an nltk tool called CategorizedPlaintextCorpusReader

Using Python's os module to build a DataFrame


In [1]:
# Perform imports:
import numpy as np
import pandas as pd
import os

Let's look at what os.walk() does:


In [14]:
gen = os.walk('../moviereviews')
next(gen)


Out[14]:
('../moviereviews - Copy', ['neg', 'pos'], ['poldata.README.2.0'])

os.walk() is a generator that returns a tuple with three items:

  1. the name of the current folder
  2. a list of names of any subfolders
  3. a list of names of any files in the current folder

In [15]:
next(gen)


Out[15]:
('../moviereviews - Copy\\neg',
 [],
 ['cv000_29416.txt',
  'cv001_19502.txt',
  'cv002_17424.txt',
  'cv003_12683.txt',
  'cv004_12641.txt',
  'cv005_29357.txt',
  'cv006_17022.txt',
  'cv007_4992.txt',
  'cv008_29326.txt',
  'cv009_29417.txt',
  'cv010_29063.txt',
  'cv011_13044.txt',
  'cv012_29411.txt',
  'cv013_10494.txt',
  'cv014_15600.txt',
  'cv015_29356.txt',
  'cv016_4348.txt',
  'cv017_23487.txt',
  'cv018_21672.txt',
  'cv019_16117.txt',
  'cv020_9234.txt',
  'cv021_17313.txt',
  'cv022_14227.txt',
  'cv023_13847.txt',
  'cv024_7033.txt',
  'cv025_29825.txt',
  'cv026_29229.txt',
  'cv027_26270.txt',
  'cv028_26964.txt',
  'cv029_19943.txt',
  'cv030_22893.txt',
  'cv031_19540.txt',
  'cv032_23718.txt',
  'cv033_25680.txt',
  'cv034_29446.txt',
  'cv035_3343.txt',
  'cv036_18385.txt',
  'cv037_19798.txt',
  'cv038_9781.txt',
  'cv039_5963.txt',
  'cv040_8829.txt',
  'cv041_22364.txt',
  'cv042_11927.txt',
  'cv043_16808.txt',
  'cv044_18429.txt',
  'cv045_25077.txt',
  'cv046_10613.txt',
  'cv047_18725.txt',
  'cv048_18380.txt',
  'cv049_21917.txt',
  'cv050_12128.txt',
  'cv051_10751.txt',
  'cv052_29318.txt',
  'cv053_23117.txt',
  'cv054_4101.txt',
  'cv055_8926.txt',
  'cv056_14663.txt',
  'cv057_7962.txt',
  'cv058_8469.txt',
  'cv059_28723.txt',
  'cv060_11754.txt',
  'cv061_9321.txt',
  'cv062_24556.txt',
  'cv063_28852.txt',
  'cv064_25842.txt',
  'cv065_16909.txt',
  'cv066_11668.txt',
  'cv067_21192.txt',
  'cv068_14810.txt',
  'cv069_11613.txt',
  'cv070_13249.txt',
  'cv071_12969.txt',
  'cv072_5928.txt',
  'cv073_23039.txt',
  'cv074_7188.txt',
  'cv075_6250.txt',
  'cv076_26009.txt',
  'cv077_23172.txt',
  'cv078_16506.txt',
  'cv079_12766.txt',
  'cv080_14899.txt',
  'cv081_18241.txt',
  'cv082_11979.txt',
  'cv083_25491.txt',
  'cv084_15183.txt',
  'cv085_15286.txt',
  'cv086_19488.txt',
  'cv087_2145.txt',
  'cv088_25274.txt',
  'cv089_12222.txt',
  'cv090_0049.txt',
  'cv091_7899.txt',
  'cv092_27987.txt',
  'cv093_15606.txt',
  'cv094_27868.txt',
  'cv095_28730.txt',
  'cv096_12262.txt',
  'cv097_26081.txt',
  'cv098_17021.txt',
  'cv099_11189.txt',
  'cv100_12406.txt',
  'cv101_10537.txt',
  'cv102_8306.txt',
  'cv103_11943.txt',
  'cv104_19176.txt',
  'cv105_19135.txt',
  'cv106_18379.txt',
  'cv107_25639.txt',
  'cv108_17064.txt',
  'cv109_22599.txt',
  'cv110_27832.txt',
  'cv111_12253.txt',
  'cv112_12178.txt',
  'cv113_24354.txt',
  'cv114_19501.txt',
  'cv115_26443.txt',
  'cv116_28734.txt',
  'cv117_25625.txt',
  'cv118_28837.txt',
  'cv119_9909.txt',
  'cv120_3793.txt',
  'cv121_18621.txt',
  'cv122_7891.txt',
  'cv123_12165.txt',
  'cv124_3903.txt',
  'cv125_9636.txt',
  'cv126_28821.txt',
  'cv127_16451.txt',
  'cv128_29444.txt',
  'cv129_18373.txt',
  'cv130_18521.txt',
  'cv131_11568.txt',
  'cv132_5423.txt',
  'cv133_18065.txt',
  'cv134_23300.txt',
  'cv135_12506.txt',
  'cv136_12384.txt',
  'cv137_17020.txt',
  'cv138_13903.txt',
  'cv139_14236.txt',
  'cv140_7963.txt',
  'cv141_17179.txt',
  'cv142_23657.txt',
  'cv143_21158.txt',
  'cv144_5010.txt',
  'cv145_12239.txt',
  'cv146_19587.txt',
  'cv147_22625.txt',
  'cv148_18084.txt',
  'cv149_17084.txt',
  'cv150_14279.txt',
  'cv151_17231.txt',
  'cv152_9052.txt',
  'cv153_11607.txt',
  'cv154_9562.txt',
  'cv155_7845.txt',
  'cv156_11119.txt',
  'cv157_29302.txt',
  'cv158_10914.txt',
  'cv159_29374.txt',
  'cv160_10848.txt',
  'cv161_12224.txt',
  'cv162_10977.txt',
  'cv163_10110.txt',
  'cv164_23451.txt',
  'cv165_2389.txt',
  'cv166_11959.txt',
  'cv167_18094.txt',
  'cv168_7435.txt',
  'cv169_24973.txt',
  'cv170_29808.txt',
  'cv171_15164.txt',
  'cv172_12037.txt',
  'cv173_4295.txt',
  'cv174_9735.txt',
  'cv175_7375.txt',
  'cv176_14196.txt',
  'cv177_10904.txt',
  'cv178_14380.txt',
  'cv179_9533.txt',
  'cv180_17823.txt',
  'cv181_16083.txt',
  'cv182_7791.txt',
  'cv183_19826.txt',
  'cv184_26935.txt',
  'cv185_28372.txt',
  'cv186_2396.txt',
  'cv187_14112.txt',
  'cv188_20687.txt',
  'cv189_24248.txt',
  'cv190_27176.txt',
  'cv191_29539.txt',
  'cv192_16079.txt',
  'cv193_5393.txt',
  'cv194_12855.txt',
  'cv195_16146.txt',
  'cv196_28898.txt',
  'cv197_29271.txt',
  'cv198_19313.txt',
  'cv199_9721.txt',
  'cv200_29006.txt',
  'cv201_7421.txt',
  'cv202_11382.txt',
  'cv203_19052.txt',
  'cv204_8930.txt',
  'cv205_9676.txt',
  'cv206_15893.txt',
  'cv207_29141.txt',
  'cv208_9475.txt',
  'cv209_28973.txt',
  'cv210_9557.txt',
  'cv211_9955.txt',
  'cv212_10054.txt',
  'cv213_20300.txt',
  'cv214_13285.txt',
  'cv215_23246.txt',
  'cv216_20165.txt',
  'cv217_28707.txt',
  'cv218_25651.txt',
  'cv219_19874.txt',
  'cv220_28906.txt',
  'cv221_27081.txt',
  'cv222_18720.txt',
  'cv223_28923.txt',
  'cv224_18875.txt',
  'cv225_29083.txt',
  'cv226_26692.txt',
  'cv227_25406.txt',
  'cv228_5644.txt',
  'cv229_15200.txt',
  'cv230_7913.txt',
  'cv231_11028.txt',
  'cv232_16768.txt',
  'cv233_17614.txt',
  'cv234_22123.txt',
  'cv235_10704.txt',
  'cv236_12427.txt',
  'cv237_20635.txt',
  'cv238_14285.txt',
  'cv239_29828.txt',
  'cv240_15948.txt',
  'cv241_24602.txt',
  'cv242_11354.txt',
  'cv243_22164.txt',
  'cv244_22935.txt',
  'cv245_8938.txt',
  'cv246_28668.txt',
  'cv247_14668.txt',
  'cv248_15672.txt',
  'cv249_12674.txt',
  'cv250_26462.txt',
  'cv251_23901.txt',
  'cv252_24974.txt',
  'cv253_10190.txt',
  'cv254_5870.txt',
  'cv255_15267.txt',
  'cv256_16529.txt',
  'cv257_11856.txt',
  'cv258_5627.txt',
  'cv259_11827.txt',
  'cv260_15652.txt',
  'cv261_11855.txt',
  'cv262_13812.txt',
  'cv263_20693.txt',
  'cv264_14108.txt',
  'cv265_11625.txt',
  'cv266_26644.txt',
  'cv267_16618.txt',
  'cv268_20288.txt',
  'cv269_23018.txt',
  'cv270_5873.txt',
  'cv271_15364.txt',
  'cv272_20313.txt',
  'cv273_28961.txt',
  'cv274_26379.txt',
  'cv275_28725.txt',
  'cv276_17126.txt',
  'cv277_20467.txt',
  'cv278_14533.txt',
  'cv279_19452.txt',
  'cv280_8651.txt',
  'cv281_24711.txt',
  'cv282_6833.txt',
  'cv283_11963.txt',
  'cv284_20530.txt',
  'cv285_18186.txt',
  'cv286_26156.txt',
  'cv287_17410.txt',
  'cv288_20212.txt',
  'cv289_6239.txt',
  'cv290_11981.txt',
  'cv291_26844.txt',
  'cv292_7804.txt',
  'cv293_29731.txt',
  'cv294_12695.txt',
  'cv295_17060.txt',
  'cv296_13146.txt',
  'cv297_10104.txt',
  'cv298_24487.txt',
  'cv299_17950.txt',
  'cv300_23302.txt',
  'cv301_13010.txt',
  'cv302_26481.txt',
  'cv303_27366.txt',
  'cv304_28489.txt',
  'cv305_9937.txt',
  'cv306_10859.txt',
  'cv307_26382.txt',
  'cv308_5079.txt',
  'cv309_23737.txt',
  'cv310_14568.txt',
  'cv311_17708.txt',
  'cv312_29308.txt',
  'cv313_19337.txt',
  'cv314_16095.txt',
  'cv315_12638.txt',
  'cv316_5972.txt',
  'cv317_25111.txt',
  'cv318_11146.txt',
  'cv319_16459.txt',
  'cv320_9693.txt',
  'cv321_14191.txt',
  'cv322_21820.txt',
  'cv323_29633.txt',
  'cv324_7502.txt',
  'cv325_18330.txt',
  'cv326_14777.txt',
  'cv327_21743.txt',
  'cv328_10908.txt',
  'cv329_29293.txt',
  'cv330_29675.txt',
  'cv331_8656.txt',
  'cv332_17997.txt',
  'cv333_9443.txt',
  'cv334_0074.txt',
  'cv335_16299.txt',
  'cv336_10363.txt',
  'cv337_29061.txt',
  'cv338_9183.txt',
  'cv339_22452.txt',
  'cv340_14776.txt',
  'cv341_25667.txt',
  'cv342_20917.txt',
  'cv343_10906.txt',
  'cv344_5376.txt',
  'cv345_9966.txt',
  'cv346_19198.txt',
  'cv347_14722.txt',
  'cv348_19207.txt',
  'cv349_15032.txt',
  'cv350_22139.txt',
  'cv351_17029.txt',
  'cv352_5414.txt',
  'cv353_19197.txt',
  'cv354_8573.txt',
  'cv355_18174.txt',
  'cv356_26170.txt',
  'cv357_14710.txt',
  'cv358_11557.txt',
  'cv359_6751.txt',
  'cv360_8927.txt',
  'cv361_28738.txt',
  'cv362_16985.txt',
  'cv363_29273.txt',
  'cv364_14254.txt',
  'cv365_12442.txt',
  'cv366_10709.txt',
  'cv367_24065.txt',
  'cv368_11090.txt',
  'cv369_14245.txt',
  'cv370_5338.txt',
  'cv371_8197.txt',
  'cv372_6654.txt',
  'cv373_21872.txt',
  'cv374_26455.txt',
  'cv375_9932.txt',
  'cv376_20883.txt',
  'cv377_8440.txt',
  'cv378_21982.txt',
  'cv379_23167.txt',
  'cv380_8164.txt',
  'cv381_21673.txt',
  'cv382_8393.txt',
  'cv383_14662.txt',
  'cv384_18536.txt',
  'cv385_29621.txt',
  'cv386_10229.txt',
  'cv387_12391.txt',
  'cv388_12810.txt',
  'cv389_9611.txt',
  'cv390_12187.txt',
  'cv391_11615.txt',
  'cv392_12238.txt',
  'cv393_29234.txt',
  'cv394_5311.txt',
  'cv395_11761.txt',
  'cv396_19127.txt',
  'cv397_28890.txt',
  'cv398_17047.txt',
  'cv399_28593.txt',
  'cv400_20631.txt',
  'cv401_13758.txt',
  'cv402_16097.txt',
  'cv403_6721.txt',
  'cv404_21805.txt',
  'cv405_21868.txt',
  'cv406_22199.txt',
  'cv407_23928.txt',
  'cv408_5367.txt',
  'cv409_29625.txt',
  'cv410_25624.txt',
  'cv411_16799.txt',
  'cv412_25254.txt',
  'cv413_7893.txt',
  'cv414_11161.txt',
  'cv415_23674.txt',
  'cv416_12048.txt',
  'cv417_14653.txt',
  'cv418_16562.txt',
  'cv419_14799.txt',
  'cv420_28631.txt',
  'cv421_9752.txt',
  'cv422_9632.txt',
  'cv423_12089.txt',
  'cv424_9268.txt',
  'cv425_8603.txt',
  'cv426_10976.txt',
  'cv427_11693.txt',
  'cv428_12202.txt',
  'cv429_7937.txt',
  'cv430_18662.txt',
  'cv431_7538.txt',
  'cv432_15873.txt',
  'cv433_10443.txt',
  'cv434_5641.txt',
  'cv435_24355.txt',
  'cv436_20564.txt',
  'cv437_24070.txt',
  'cv438_8500.txt',
  'cv439_17633.txt',
  'cv440_16891.txt',
  'cv441_15276.txt',
  'cv442_15499.txt',
  'cv443_22367.txt',
  'cv444_9975.txt',
  'cv445_26683.txt',
  'cv446_12209.txt',
  'cv447_27334.txt',
  'cv448_16409.txt',
  'cv449_9126.txt',
  'cv450_8319.txt',
  'cv451_11502.txt',
  'cv452_5179.txt',
  'cv453_10911.txt',
  'cv454_21961.txt',
  'cv455_28866.txt',
  'cv456_20370.txt',
  'cv457_19546.txt',
  'cv458_9000.txt',
  'cv459_21834.txt',
  'cv460_11723.txt',
  'cv461_21124.txt',
  'cv462_20788.txt',
  'cv463_10846.txt',
  'cv464_17076.txt',
  'cv465_23401.txt',
  'cv466_20092.txt',
  'cv467_26610.txt',
  'cv468_16844.txt',
  'cv469_21998.txt',
  'cv470_17444.txt',
  'cv471_18405.txt',
  'cv472_29140.txt',
  'cv473_7869.txt',
  'cv474_10682.txt',
  'cv475_22978.txt',
  'cv476_18402.txt',
  'cv477_23530.txt',
  'cv478_15921.txt',
  'cv479_5450.txt',
  'cv480_21195.txt',
  'cv481_7930.txt',
  'cv482_11233.txt',
  'cv483_18103.txt',
  'cv484_26169.txt',
  'cv485_26879.txt',
  'cv486_9788.txt',
  'cv487_11058.txt',
  'cv488_21453.txt',
  'cv489_19046.txt',
  'cv490_18986.txt',
  'cv491_12992.txt',
  'cv492_19370.txt',
  'cv493_14135.txt',
  'cv494_18689.txt',
  'cv495_16121.txt',
  'cv496_11185.txt',
  'cv497_27086.txt',
  'cv498_9288.txt',
  'cv499_11407.txt',
  'cv500_10722.txt',
  'cv501_12675.txt',
  'cv502_10970.txt',
  'cv503_11196.txt',
  'cv504_29120.txt',
  'cv505_12926.txt',
  'cv506_17521.txt',
  'cv507_9509.txt',
  'cv508_17742.txt',
  'cv509_17354.txt',
  'cv510_24758.txt',
  'cv511_10360.txt',
  'cv512_17618.txt',
  'cv513_7236.txt',
  'cv514_12173.txt',
  'cv515_18484.txt',
  'cv516_12117.txt',
  'cv517_20616.txt',
  'cv518_14798.txt',
  'cv519_16239.txt',
  'cv520_13297.txt',
  'cv521_1730.txt',
  'cv522_5418.txt',
  'cv523_18285.txt',
  'cv524_24885.txt',
  'cv525_17930.txt',
  'cv526_12868.txt',
  'cv527_10338.txt',
  'cv528_11669.txt',
  'cv529_10972.txt',
  'cv530_17949.txt',
  'cv531_26838.txt',
  'cv532_6495.txt',
  'cv533_9843.txt',
  'cv534_15683.txt',
  'cv535_21183.txt',
  'cv536_27221.txt',
  'cv537_13516.txt',
  'cv538_28485.txt',
  'cv539_21865.txt',
  'cv540_3092.txt',
  'cv541_28683.txt',
  'cv542_20359.txt',
  'cv543_5107.txt',
  'cv544_5301.txt',
  'cv545_12848.txt',
  'cv546_12723.txt',
  'cv547_18043.txt',
  'cv548_18944.txt',
  'cv549_22771.txt',
  'cv550_23226.txt',
  'cv551_11214.txt',
  'cv552_0150.txt',
  'cv553_26965.txt',
  'cv554_14678.txt',
  'cv555_25047.txt',
  'cv556_16563.txt',
  'cv557_12237.txt',
  'cv558_29376.txt',
  'cv559_0057.txt',
  'cv560_18608.txt',
  'cv561_9484.txt',
  'cv562_10847.txt',
  'cv563_18610.txt',
  'cv564_12011.txt',
  'cv565_29403.txt',
  'cv566_8967.txt',
  'cv567_29420.txt',
  'cv568_17065.txt',
  'cv569_26750.txt',
  'cv570_28960.txt',
  'cv571_29292.txt',
  'cv572_20053.txt',
  'cv573_29384.txt',
  'cv574_23191.txt',
  'cv575_22598.txt',
  'cv576_15688.txt',
  'cv577_28220.txt',
  'cv578_16825.txt',
  'cv579_12542.txt',
  'cv580_15681.txt',
  'cv581_20790.txt',
  'cv582_6678.txt',
  'cv583_29465.txt',
  'cv584_29549.txt',
  'cv585_23576.txt',
  'cv586_8048.txt',
  'cv587_20532.txt',
  'cv588_14467.txt',
  'cv589_12853.txt',
  'cv590_20712.txt',
  'cv591_24887.txt',
  'cv592_23391.txt',
  'cv593_11931.txt',
  'cv594_11945.txt',
  'cv595_26420.txt',
  'cv596_4367.txt',
  'cv597_26744.txt',
  'cv598_18184.txt',
  'cv599_22197.txt',
  'cv600_25043.txt',
  'cv601_24759.txt',
  'cv602_8830.txt',
  'cv603_18885.txt',
  'cv604_23339.txt',
  'cv605_12730.txt',
  'cv606_17672.txt',
  'cv607_8235.txt',
  'cv608_24647.txt',
  'cv609_25038.txt',
  'cv610_24153.txt',
  'cv611_2253.txt',
  'cv612_5396.txt',
  'cv613_23104.txt',
  'cv614_11320.txt',
  'cv615_15734.txt',
  'cv616_29187.txt',
  'cv617_9561.txt',
  'cv618_9469.txt',
  'cv619_13677.txt',
  'cv620_2556.txt',
  'cv621_15984.txt',
  'cv622_8583.txt',
  'cv623_16988.txt',
  'cv624_11601.txt',
  'cv625_13518.txt',
  'cv626_7907.txt',
  'cv627_12603.txt',
  'cv628_20758.txt',
  'cv629_16604.txt',
  'cv630_10152.txt',
  'cv631_4782.txt',
  'cv632_9704.txt',
  'cv633_29730.txt',
  'cv634_11989.txt',
  'cv635_0984.txt',
  'cv636_16954.txt',
  'cv637_13682.txt',
  'cv638_29394.txt',
  'cv639_10797.txt',
  'cv640_5380.txt',
  'cv641_13412.txt',
  'cv642_29788.txt',
  'cv643_29282.txt',
  'cv644_18551.txt',
  'cv645_17078.txt',
  'cv646_16817.txt',
  'cv647_15275.txt',
  'cv648_17277.txt',
  'cv649_13947.txt',
  'cv650_15974.txt',
  'cv651_11120.txt',
  'cv652_15653.txt',
  'cv653_2107.txt',
  'cv654_19345.txt',
  'cv655_12055.txt',
  'cv656_25395.txt',
  'cv657_25835.txt',
  'cv658_11186.txt',
  'cv659_21483.txt',
  'cv660_23140.txt',
  'cv661_25780.txt',
  'cv662_14791.txt',
  'cv663_14484.txt',
  'cv664_4264.txt',
  'cv665_29386.txt',
  'cv666_20301.txt',
  'cv667_19672.txt',
  'cv668_18848.txt',
  'cv669_24318.txt',
  'cv670_2666.txt',
  'cv671_5164.txt',
  'cv672_27988.txt',
  'cv673_25874.txt',
  'cv674_11593.txt',
  'cv675_22871.txt',
  'cv676_22202.txt',
  'cv677_18938.txt',
  'cv678_14887.txt',
  'cv679_28221.txt',
  'cv680_10533.txt',
  'cv681_9744.txt',
  'cv682_17947.txt',
  'cv683_13047.txt',
  'cv684_12727.txt',
  'cv685_5710.txt',
  'cv686_15553.txt',
  'cv687_22207.txt',
  'cv688_7884.txt',
  'cv689_13701.txt',
  'cv690_5425.txt',
  'cv691_5090.txt',
  'cv692_17026.txt',
  'cv693_19147.txt',
  'cv694_4526.txt',
  'cv695_22268.txt',
  'cv696_29619.txt',
  'cv697_12106.txt',
  'cv698_16930.txt',
  'cv699_7773.txt',
  'cv700_23163.txt',
  'cv701_15880.txt',
  'cv702_12371.txt',
  'cv703_17948.txt',
  'cv704_17622.txt',
  'cv705_11973.txt',
  'cv706_25883.txt',
  'cv707_11421.txt',
  'cv708_28539.txt',
  'cv709_11173.txt',
  'cv710_23745.txt',
  'cv711_12687.txt',
  'cv712_24217.txt',
  'cv713_29002.txt',
  'cv714_19704.txt',
  'cv715_19246.txt',
  'cv716_11153.txt',
  'cv717_17472.txt',
  'cv718_12227.txt',
  'cv719_5581.txt',
  'cv720_5383.txt',
  'cv721_28993.txt',
  'cv722_7571.txt',
  'cv723_9002.txt',
  'cv724_15265.txt',
  'cv725_10266.txt',
  'cv726_4365.txt',
  'cv727_5006.txt',
  'cv728_17931.txt',
  'cv729_10475.txt',
  'cv730_10729.txt',
  'cv731_3968.txt',
  'cv732_13092.txt',
  'cv733_9891.txt',
  'cv734_22821.txt',
  'cv735_20218.txt',
  'cv736_24947.txt',
  'cv737_28733.txt',
  'cv738_10287.txt',
  'cv739_12179.txt',
  'cv740_13643.txt',
  'cv741_12765.txt',
  'cv742_8279.txt',
  'cv743_17023.txt',
  'cv744_10091.txt',
  'cv745_14009.txt',
  'cv746_10471.txt',
  'cv747_18189.txt',
  'cv748_14044.txt',
  'cv749_18960.txt',
  'cv750_10606.txt',
  'cv751_17208.txt',
  'cv752_25330.txt',
  'cv753_11812.txt',
  'cv754_7709.txt',
  'cv755_24881.txt',
  'cv756_23676.txt',
  'cv757_10668.txt',
  'cv758_9740.txt',
  'cv759_15091.txt',
  'cv760_8977.txt',
  'cv761_13769.txt',
  'cv762_15604.txt',
  'cv763_16486.txt',
  'cv764_12701.txt',
  'cv765_20429.txt',
  'cv766_7983.txt',
  'cv767_15673.txt',
  'cv768_12709.txt',
  'cv769_8565.txt',
  'cv770_11061.txt',
  'cv771_28466.txt',
  'cv772_12971.txt',
  'cv773_20264.txt',
  'cv774_15488.txt',
  'cv775_17966.txt',
  'cv776_21934.txt',
  'cv777_10247.txt',
  'cv778_18629.txt',
  'cv779_18989.txt',
  'cv780_8467.txt',
  'cv781_5358.txt',
  'cv782_21078.txt',
  'cv783_14724.txt',
  'cv784_16077.txt',
  'cv785_23748.txt',
  'cv786_23608.txt',
  'cv787_15277.txt',
  'cv788_26409.txt',
  'cv789_12991.txt',
  'cv790_16202.txt',
  'cv791_17995.txt',
  'cv792_3257.txt',
  'cv793_15235.txt',
  'cv794_17353.txt',
  'cv795_10291.txt',
  'cv796_17243.txt',
  'cv797_7245.txt',
  'cv798_24779.txt',
  'cv799_19812.txt',
  'cv800_13494.txt',
  'cv801_26335.txt',
  'cv802_28381.txt',
  'cv803_8584.txt',
  'cv804_11763.txt',
  'cv805_21128.txt',
  'cv806_9405.txt',
  'cv807_23024.txt',
  'cv808_13773.txt',
  'cv809_5012.txt',
  'cv810_13660.txt',
  'cv811_22646.txt',
  'cv812_19051.txt',
  'cv813_6649.txt',
  'cv814_20316.txt',
  'cv815_23466.txt',
  'cv816_15257.txt',
  'cv817_3675.txt',
  'cv818_10698.txt',
  'cv819_9567.txt',
  'cv820_24157.txt',
  'cv821_29283.txt',
  'cv822_21545.txt',
  'cv823_17055.txt',
  'cv824_9335.txt',
  'cv825_5168.txt',
  'cv826_12761.txt',
  'cv827_19479.txt',
  'cv828_21392.txt',
  'cv829_21725.txt',
  'cv830_5778.txt',
  'cv831_16325.txt',
  'cv832_24713.txt',
  'cv833_11961.txt',
  'cv834_23192.txt',
  'cv835_20531.txt',
  'cv836_14311.txt',
  'cv837_27232.txt',
  'cv838_25886.txt',
  'cv839_22807.txt',
  'cv840_18033.txt',
  'cv841_3367.txt',
  'cv842_5702.txt',
  'cv843_17054.txt',
  'cv844_13890.txt',
  'cv845_15886.txt',
  'cv846_29359.txt',
  'cv847_20855.txt',
  'cv848_10061.txt',
  'cv849_17215.txt',
  'cv850_18185.txt',
  'cv851_21895.txt',
  'cv852_27512.txt',
  'cv853_29119.txt',
  'cv854_18955.txt',
  'cv855_22134.txt',
  'cv856_28882.txt',
  'cv857_17527.txt',
  'cv858_20266.txt',
  'cv859_15689.txt',
  'cv860_15520.txt',
  'cv861_12809.txt',
  'cv862_15924.txt',
  'cv863_7912.txt',
  'cv864_3087.txt',
  'cv865_28796.txt',
  'cv866_29447.txt',
  'cv867_18362.txt',
  'cv868_12799.txt',
  'cv869_24782.txt',
  'cv870_18090.txt',
  'cv871_25971.txt',
  'cv872_13710.txt',
  'cv873_19937.txt',
  'cv874_12182.txt',
  'cv875_5622.txt',
  'cv876_9633.txt',
  'cv877_29132.txt',
  'cv878_17204.txt',
  'cv879_16585.txt',
  'cv880_29629.txt',
  'cv881_14767.txt',
  'cv882_10042.txt',
  'cv883_27621.txt',
  'cv884_15230.txt',
  'cv885_13390.txt',
  'cv886_19210.txt',
  'cv887_5306.txt',
  'cv888_25678.txt',
  'cv889_22670.txt',
  'cv890_3515.txt',
  'cv891_6035.txt',
  'cv892_18788.txt',
  'cv893_26731.txt',
  'cv894_22140.txt',
  'cv895_22200.txt',
  'cv896_17819.txt',
  'cv897_11703.txt',
  'cv898_1576.txt',
  'cv899_17812.txt',
  'cv900_10800.txt',
  'cv901_11934.txt',
  'cv902_13217.txt',
  'cv903_18981.txt',
  'cv904_25663.txt',
  'cv905_28965.txt',
  'cv906_12332.txt',
  'cv907_3193.txt',
  'cv908_17779.txt',
  'cv909_9973.txt',
  'cv910_21930.txt',
  'cv911_21695.txt',
  'cv912_5562.txt',
  'cv913_29127.txt',
  'cv914_2856.txt',
  'cv915_9342.txt',
  'cv916_17034.txt',
  'cv917_29484.txt',
  'cv918_27080.txt',
  'cv919_18155.txt',
  'cv920_29423.txt',
  'cv921_13988.txt',
  'cv922_10185.txt',
  'cv923_11951.txt',
  'cv924_29397.txt',
  'cv925_9459.txt',
  'cv926_18471.txt',
  'cv927_11471.txt',
  'cv928_9478.txt',
  'cv929_1841.txt',
  'cv930_14949.txt',
  'cv931_18783.txt',
  'cv932_14854.txt',
  'cv933_24953.txt',
  'cv934_20426.txt',
  'cv935_24977.txt',
  'cv936_17473.txt',
  'cv937_9816.txt',
  'cv938_10706.txt',
  'cv939_11247.txt',
  'cv940_18935.txt',
  'cv941_10718.txt',
  'cv942_18509.txt',
  'cv943_23547.txt',
  'cv944_15042.txt',
  'cv945_13012.txt',
  'cv946_20084.txt',
  'cv947_11316.txt',
  'cv948_25870.txt',
  'cv949_21565.txt',
  'cv950_13478.txt',
  'cv951_11816.txt',
  'cv952_26375.txt',
  'cv953_7078.txt',
  'cv954_19932.txt',
  'cv955_26154.txt',
  'cv956_12547.txt',
  'cv957_9059.txt',
  'cv958_13020.txt',
  'cv959_16218.txt',
  'cv960_28877.txt',
  'cv961_5578.txt',
  'cv962_9813.txt',
  'cv963_7208.txt',
  'cv964_5794.txt',
  'cv965_26688.txt',
  'cv966_28671.txt',
  'cv967_5626.txt',
  'cv968_25413.txt',
  'cv969_14760.txt',
  'cv970_19532.txt',
  'cv971_11790.txt',
  'cv972_26837.txt',
  'cv973_10171.txt',
  'cv974_24303.txt',
  'cv975_11920.txt',
  'cv976_10724.txt',
  'cv977_4776.txt',
  'cv978_22192.txt',
  'cv979_2029.txt',
  'cv980_11851.txt',
  'cv981_16679.txt',
  'cv982_22209.txt',
  'cv983_24219.txt',
  'cv984_14006.txt',
  'cv985_5964.txt',
  'cv986_15092.txt',
  'cv987_7394.txt',
  'cv988_20168.txt',
  'cv989_17297.txt',
  'cv990_12443.txt',
  'cv991_19973.txt',
  'cv992_12806.txt',
  'cv993_29565.txt',
  'cv994_13229.txt',
  'cv995_23113.txt',
  'cv996_12447.txt',
  'cv997_5152.txt',
  'cv998_15691.txt',
  'cv999_14636.txt'])

The subfolder ../moviereviews/neg contains 1000 text files.


In [16]:
next(gen) # this walks the /pos/ subfolder
next(gen)


---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-16-e2a758a6db89> in <module>()
      1 next(gen) # this walks the /pos/ subfolder
----> 2 next(gen)

StopIteration: 

os.walk() stopped once it had walked all subfolders.

Use os.walk() to build a DataFrame

The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.
We'll take the following steps to build our list:

  1. Start with a list of subdirectory names ('neg' and 'pos')
  2. Walk each subdirectory
  3. Create a dictionary object for every file in a subdirectory where label is either 'neg' or 'pos', and review is the text of the file.
  4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values.

In [20]:
row_list = []

for subdir in ['neg','pos']:
    for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):
        for file in filenames:
            d = {'label':subdir}  # assign the name of the subdirectory to the label field
            with open('moviereviews/'+subdir+'/'+file) as f:
                if f.read():      # handles the case of empty files, which become NaN on import
                    f.seek(0)
                    d['review'] = f.read()  # assign the contents of the file to the review field
            row_list.append(d)
        break

In [21]:
df = pd.DataFrame(row_list)

In [22]:
df.head()


Out[22]:
label review
0 neg NaN
1 neg the happy bastard's quick movie review \ndamn ...
2 neg it is movies like these that make a jaded movi...
3 neg " quest for camelot " is warner bros . ' firs...
4 neg synopsis : a mentally unstable man undergoing ...

In [ ]: