Ta:Programming Fundamentals and Python
From NLTK
Contents |
2 நிரலாக்க அடிப்படைகளும் பைதானும்
பைதான் குறித்த எளியதொரு அறிமுகத்தினைத் தந்து முதற் பகுதியில் தொடர்ந்து வரும் பாடங்களுக்குத் தேவையான நிரலாக்க அறிவினை இப்பாடம் தரும். இது பல்வேறு உதாரணங்களையும் பயிற்சிகளையும் கொண்டு விளங்குகிறது. தாங்களே மூழ்கி முயற்சி செய்து பார்ப்பதைப் போல் நிரலாக்கத்திறனைக் கற்பதற்கு உகந்த வழிகள் இல்லை. அதன் பின்னர் தாங்கள் பயின்றவற்றை தங்களின் தேவைகளுக்குத் தக்கவாறு மாற்றிக் கொள்ள வேண்டும். தாங்களதனை அறியுமுன்னரே நிரலாக்கத் துவங்கியிருப்பீர்கள்.
2.1 சுழி
பைதானின் சிறப்பம்சம் யாதெனின் உடனுரைக்கும் வரியொடுக்கியில் (தங்களின் பைதான் நிரல்களை இயக்கும் நிரல்) தாங்கள் நேரடியாக உள்ளிடலாம். இன்டராக்டிவ் கிராபிகல் டெவலப்மென்ட் என்விரான்மென்ட் எனும் வரைகலை இடைமுகப்பினைக் கொண்டு பைதான் வரியொடுக்கியைத் தாங்கள் இயக்கலாம். மாக் கணினிகளில் தாங்கள் இதனை Applications -> MacPython லும், விண்டோஸில் All Programs -> Python லிலும் காணலாம். யுனிக்ஸ் வழி வந்த இயங்குதளங்களில் முனையத்திலிருந்து python எனும் ஆணையிட்டால் போதும். வரியொடுக்கி பைதானின் வெளியீட்டு வகை முதலியவற்றைத் தரும். பைதான் 2.4 க்கும் மேற்பட்ட வெளியீட்டினைப் பயன்படுத்துகிறீர்களா என சரி பார்க்கவும் (இங்கே பைதான் 2.5 திரையிடப்படுவதைக் காண்கிறீர்கள்):
Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
குறிப்பு
பைதான் வரியொடுக்கியினைத் தங்களால் துவங்க இயலவில்லையென்றால் தாங்கள் பைதானைச் சரியாக நிறுவவில்லை எனப் பொருள். உதவிக்கு http://nltk.org/ தளத்தினை அணுகவும்.
>>> போல் தெரியும் தூண்டில் உள்ளீட்டுக்காகக் காத்திருக்கிறது. பைதானைக் கணிப்பொறியாக பயன்படுத்துவோமாக:
>>> 3 + 2 * 5 - 1 12 >>>
இங்கே கவனிக்கத் தக்க பல விடயங்கள் உள்ளன. முதலாவது வரியொடுக்கி விடையினை கணக்கிட்டு தந்ததும் தூண்டில் மீண்டும் தெரிகிறது. அதாவது பைதான் வரியொடுக்கி மீண்டும் உள்ளீட்டிற்காக காத்திருக்கிறது. அடுத்து, கணக்கிடுதலை பைதான் முறையாகச் செய்வதால் 2 * 5 கணக்கிடப் பட்டு பின்னரே 3 உடன் கூட்டப் படுகிறது.
மேலும் சில கணக்கீடுகளைச் செய்து பார்க்கவும். பெருக்கலுக்கு (*) குறியீட்டினையும் (/) குறியீட்டினை வகுத்தலுக்கும் தாங்கள் பயன்படுத்தலாம். வகுத்தல் தாங்கள் விரும்புவது போல் பணிபுரியும் விநோதமானச் சூழலை தாங்கள் எதிர்நோக்கலாம்.
>>> 3/3 1 >>> 1/3 0 >>>
0.333333 என இரண்டாவது உதாரணம் விடைத் தராததால் சற்றே விநோதமாகத் தோன்றலாம். இப்பாடத்தின் பிற்பகுதியில் அங்ஙனம் ஏன் நிகழ்கிறது என்பது குறித்து அலசலாம். இப்போதைக்கு வரியொடுக்கியினைப் உடனுரையும் வண்ணம் பயன்படுத்த இப்பயிற்சிகள் உதவுவதாகக் கருதி ஆராயவும். மேலும் எண் கணக்குகள் குறித்த தங்களது மனப்பாடங்கள் பிற தரவு வகைகளுக்குப் பொருத்தும் வழிமுறைகளையும் பின்னர் காணவிருக்கிறீர்கள்.
சில அபத்தமான கூறுபாடுகளையும் முயற்சி செய்து பார்த்து அதனை ஒடுக்கி கையாளும் முறைகளைக் கவனிக்கவும்:
>>> 1 +
Traceback (most recent call last):
File "<stdin>", line 1
1 +
^
SyntaxError: invalid syntax
>>>
இங்கே நெறி வழு இருப்பதாகச் சுட்டப் படுவதை கவனிக்கவும். ஒரு உத்தரவு கூட்டல் குறியீட்டுடன் நிறைவடைவது அத்துனை அர்த்தம் தருவதாக இல்லை. பைதான் ஒடுக்கி எவ்விடத்தே பிரச்சனை இருக்கிறது எனச் சுட்டுவதையும் கவனிக்கவும்.
2.2 அடிப்படை விடயங்கள்: சரங்களும் மாறிகளும்
2.2.1 அகரங்களை கையாளுதல்
உரையொன்றினை நேரடியாகப் பைதான் ஒடுக்கியில் உள்ளிட முடியாது ஏனெனில் அது பைதானுடையதாக அவ்வுரையினை பாவிக்க முற்படும்:
>>> வணக்கம் தமிழகம்
Traceback (most recent call last):
File "<stdin>", line 1
வணக்கம் தமிழகம்
^
SyntaxError: invalid syntax
>>>
வழு அறிக்கை தரப் படுவதை கவனிக்கவும்.
சரத்தின் துணையினைப் பயன்படுத்தி பைதான் உரையொன்றினை பிரதிபலிக்கும். சரங்கள் நிரலின் ஏனையவற்றிலிருந்து மேற்கோள் குறிகளின் துணைக் கொண்டு வேறுபடுத்திக் காட்டப் படுகின்றன.
>>> 'வணக்கம் தமிழகம்' 'வணக்கம் தமிழகம்' >>> "வணக்கம் தமிழகம்" 'வணக்கம் தமிழகம்' >>>
ஒற்றை இல்லது இரட்டை மேற்கோள் குறிகள் இரண்டில் ஏதேனும் ஒன்றை பயன்படுத்தலாம். ஒரே ஒரு நிபந்தனை தான். சரத்தின் இரு புறமும் பயன்படுத்தப் படும் குறி ஒரே மாதிரியானதாக இருத்தல் வேண்டும்.
கணிப்பொறியொன்றில் எண்களை வைத்து நாம் மேற்கொள்ளும் செயல்களைப் போல் சரங்களை வைத்து செய்யப் படும் செயல்கள் சில வருமாறு. இரண்டு சரங்களை இணைக்க கூட்டல் குறியீடு இடுவது சுயமாக விளங்கிக் கொள்ளத் தக்கதுதானே:
>>> 'வணக்கம்' + 'தமிழகம்' 'வணக்கம்தமிழகம்' >>>
சரங்களுக்கு இடப் படும் போது + குறியீடு இணைப்பதாகக் கருதப் படுகிறது. கொடுக்கப் பட்ட சரங்கள் இணைக்கப் பட்ட புதிய சரத்தினை அது தருகிறது. இவ்விணைப்பு இரு சரங்களுக்கு இடையே புத்திசாலித்தனமாக வெளியெதையும் சேர்க்கவில்லை எனபதனை நோக்கவும். பைதான் ஒடுக்கிக்கு தங்களுக்கு வெளி தேவை என்பதை அறிந்திருக்க வாய்ப்பெதுவும் இல்லை. மேற்கூறிய உதாரணத்தைக் கொண்டு பெருக்கல் குறியீடு செய்யும் பணியைத் தாங்களாகவே ஊகித்திருப்பீர்கள்:
>>> 'வணக்கம்' + 'வணக்கம்' + 'வணக்கம்' 'வணக்கம்வணக்கம்வணக்கம்' >>> 'வணக்கம்' * 3 'வணக்கம்வணக்கம்வணக்கம்' >>>
பைதானில் என்ன நடக்கிறது எனபதனை சுயமாக உணர்வதற்கு இருக்கும் இவ்வாய்ப்பானது தங்களை நெடுந்தூரம் இட்டுச் செல்ல உதவுகிறது. என்ன நடக்கிறது எனபதனை அறிய இவ்விஷயங்களை முயற்சித்துப் பார்ப்பதே அதிக பலன் தர வல்லது. தவறேதும் தாங்கள் செய்துவிடப் போகவில்லையாதலால் முயற்சித்துதான் பாருங்களேன்.
2.2.2 மதிப்புகளை சேமித்து மறு உபயோகம் செய்தல்
பழகிய சில நேரத்திற்குப் பிறகு பைதான் வாக்கியங்களை மீண்டும் மீண்டும் வாசிப்பது சலிப்பூட்டுவதாக இருக்கலாம். 'வணக்கம்' + 'வணக்கம்' + 'வணக்கம்' என்பது போன்ற கூறுகளை சேமித்து மீண்டும் பயன்படுத்துவது பயனளிப்பதாக இருக்கும். இவைத் தரும் விடைகளை கணினியின் நினைவில் இட்டு, அதற்கொரு பெயரிடுவதன் மூலம் இதனைச் செய்யலாம். பெயரிடப் பட்ட இந் நினைவிடத்தின் பெயர் மாறியாகும். பைதானில் ஒப்புமையின் வாயிலாக மாறிகளை நாம் உருவாக்குகின்றோம். அதாவது மதிப்பொன்றினை மாறிக்கு வழங்குவது.
>>> saram = 'வணக்கம் தமிழகம்' [1] >>> saram [2] 'வணக்கம் தமிழகம்' # [3] >>>
வரி [1] ல் saram எனும் மாறிக்கு 'வணக்கம் தமிழகம்' எனும் மதிப்பினை நாம் ஒப்புக் கொடுக்கிறோம். '=' குறியானது வலப்பக்கமிருக்கும் கூற்றின் மதிப்பினை இடப் பக்கமிருக்கும் மாறிக்கு ஒப்புத் தருகிறது. பைதான் ஒடுக்கி எதையும் வெளிக்கிடவில்லை என்பதனை நோக்கவும். வாக்கியமானது மதிப்பொன்றினை மீளத் தரும்போதுதான் அங்ஙனம் நேரும். சமன்பாட்டு வாக்கியம் அங்ஙனம் மதிப்பெதையும் மீளத்தராது. வரி [2] ல் மாறியின் மதிப்பினை அறியும் பொருட்டு அதனை முனையத்தில் இடுகின்றோம். அதாவது saram எனும் பெயரைத் தருகிறோம். ஒடுக்கி மாறியின் மதிப்புகளை வரி [3] ல் தருவதை நோக்குக.
மாறிகள் மதிப்புகளை பிரதிபலிக்கின்றன. எனவே 'வணக்கம்' * 3 என இயற்றாது saram என்பதற்கு 'வணக்கம்' என்பதனையும் 'perukkam' எனபதற்கு 3 எனும் மதிப்பினையும் ஒப்புவித்து பெருக்கலை மேற்கொள்ளலாம்:
>>> saram = 'Hi' >>> perukkam = 3 >>> saram * perukkam 'வணக்கம்வணக்கம்வணக்கம்' >>>
மாறிகளுக்கு பெயரிடுவது முற்றிலும் நமது விருப்பமே. saram, perukkam எனபதற்குப் பதிலாக sangeetham, santhosam என வைத்தாலும் கிடைக்கப் போகும் விடை ஒன்றுதான்:
>>> sangeetham = 'வணக்கம்' >>> santhosam = 3 >>> sangeetham * santhosam 'வணக்கம்வணக்கம்வணக்கம்' >>>
அர்த்தமுள்ள பெயர்களை மாறிகளுக்குச் சூட்டுவதென்பது நிரலினை வாசிக்கும் ஒருவருக்கு அதனைப் புரிந்து கொள்வதை சுலபமாக்குகிறது. பெயர்களை பற்றி கவலைப் படாது பைதான் கண்ணை மூடிக் கொண்டு தாங்களிட்டப் பணியினை செய்யும். irandu = 3 எனபது போன்றக் குழப்பான சமன்பாடுகளைத் தாங்களிட்டாலும் தங்கள் பைதான் அதனைச் சட்டை செய்யாது.
மாறிக்கு புதியதொரு மதிப்பினையும் நம்மால் ஒப்புத் தர இயலும்.
>>> saram = saram * perukkam >>> msg 'வணக்கம்வணக்கம்வணக்கம்' >>>
இங்கே saram எனும் மதிப்பினை perukkam த்தால் பெருக்குவதால் கிடைக்கும் விடையினை ('வணக்கம்வணக்கம்வணக்கம்') மீண்டும் saram த்துக்கே ஒப்புக் கொடுக்கிறோம்.
2.2.3 சரங்களை அச்சிட, ஆராய
கணக்கொன்றின் விடையினைக் காணவும் மாறியிலுள்ள மதிப்பினை பார்க்கவுமே இதுவரை நாம் முயன்றுள்ளோம். இதன் பொருட்டு மாறியின் பெயரை ஒடுக்கியினுள் இட்டோம். saram த்தின் மதிப்பினை print saram எனத் தந்தும் காணலாம்:
>>> saram = 'வணக்கம் தமிழகம்' >>> saram 'வணக்கம் தமிழகம்' >>> print saram வணக்கம் தமிழகம் >>>
கூர்ந்து கவனித்தால் இரண்டாவது உதாரணத்தில் மேற்கோள் குறி விடுபட்டிருப்பதைக் காணலாம். உடனுரை ஒடுக்கியினுள் மாறியின் பெயரை அப்படியே இடுவதென்பது பைதானில் பிரதிபலிக்கப் பட்டுள்ள அதன் மதிப்பினைத் தருகிறது. மாறாக print வாசகமானது அதன் உண்மையான மதிப்பினை தருகிறது.
print வாசகத்தில் அரைப்புள்ளியால் பிரிக்கப் பட்டக் கூறுகளை தாங்கள் இட முடியும்:
>>> saram2 = 'நன்றி' >>> print saram, saram2 வணக்கம் தமிழகம் நன்றி >>>
குறிப்பு
maari எனும் மாறியினைத் தாங்கள் உருவாக்கியிருந்தால், help(maari) எனக் கொடுப்பதன் மூலம் இப் பொருளுக்கான உதவிக் குறிப்புகளைக் காணலாம். dir(maari) கொடுத்தால் அப்பொருளினைக் கொண்டு செய்யத் தக்க செயல்கள் குறித்து அறியலாம்.
2.2.4 உரைத் திருத்தியின் துணையுடன் நிரலுருவாக்கம்
பைதான் உடனுரை ஒடுக்கியானது தாங்கள் உள்ளிட்ட உடனேயே தாங்களிடும் கட்டளைகளை செயல்படுத்தும். பெரும்பான்மையான சமயங்களில் பல வரிகளை உள்ளடக்கிய நிரலொன்றை உரை திருத்தியின் துணையுடன் இயற்றி, அதனை இயக்குமாறு பைதானை ஏவுவது பயனுள்ளதாக இருக்கும். IDLE பயன்படுத்தி இதனைத் தாங்கள் செய்யலாம் அல்லது Gedit, Kate போன்ற உரைதிருத்திகளைக் கொண்டும் செய்யலாம். IDLE தற்சமயம் தமிழ் மொழி ஆதரவு தராத காரணத்தினால் தாங்கள் விரும்பும் உரைத் திருத்தியினைத் துவங்கி கீழ்காணும் ஒரு வரி நிரலினை உள்ளிடவும்:
#!/usr/bin/python # -*- coding: utf-8 -*- saram='அம்மா'
இக்கோப்பினை sothanai.py எனும் பெயரிட்டுக் காக்கவும். தங்களின் முனையத்தினை துவக்கி இக் கோப்பிருக்கும் அடைவிற்குப் பயனிக்கவும். பயனித்ததும் python sothanai.py எனும் கட்டளைத் தரவும். எதையும் வெளிக்கிடாது மீண்டும் பழைய நிலைக்கே திரும்புவதைக் காணலாம்:
எழும் கேள்வியென்னவென்றால் மாறிக்கு கொடுக்கப் பட்ட மதிப்பு ஏன் திரையிடப் படவில்லை? print ஆணைக் கொடுத்து அச்சிடுக என குறிப்பிட்டு சொன்னால் மாத்திரமே sothanai.py நிரலானது மதிப்பினை வெளியிடும். எனவே கூடுதலாக ஒரு வரியினைக் கீழ்காணும் படிக்கு சேர்க்கவும்:
#!/usr/bin/python # -*- coding: utf-8 -*- saram='அம்மா' print saram
முன்னர் கூறிய படி முனையத்திலிருந்து python sothanai.py எனும் ஆணையிட்டால் கீழ்காணும் படி தங்களுக்கு saram த்தின் மதிப்பு திரையிடப் படும்.
amachu@amachu-laptop:~$ python sothanai.py amachu@amachu-laptop:~$ python sothanai.py அம்மா
ஆக இத்தருணத்திலிருந்து தங்களுக்கு இரண்டு தேர்வுகள் இருக்கிறது. உடனுரை ஒடுக்கி அல்லது உரைத் திருத்தியின் துணை. தங்கள் சிந்தனைகளை உரைத் திருத்தியின் துணையுடன் சோதித்து பார்த்து, தாங்கள் எதிர்பார்த்த பலனை அவை தரும் பட்சத்தில் அவற்றை நகலெடுத்து உரைத் திருத்தியில் ஒட்டி, விரிவுபடுத்திக் காத்து இயக்கவும். இதன் மூலம் இவற்றை மீண்டும் மீண்டும் தட்டெழுத வேண்டிய அவசியமும் தகர்கிறது.
2.2.5 பயிற்சிகள்
1.
☼ Start up the Python interpreter (e.g. by running IDLE). Try the examples in section 2.1, then experiment with using Python as a calculator. 2.
☼ Try the examples in this section, then try the following.
1.
Create a variable called msg and put a message of your own in this variable. Remember that strings need to be quoted, so you will need to type something like:
>>> msg = "I like NLP!"
2.
Now print the contents of this variable in two ways, first by simply typing the variable name and pressing enter, then by using the print command.
3.
Try various arithmetic expressions using this string, e.g. msg + msg, and 5 * msg.
4.
Define a new string hello, and then try hello + msg. Change the hello string so that it ends with a space character, and then try hello + msg again.
2.3 வெட்டும் துண்டும்
சரங்கள் முக்கியத்துவம் வாய்ந்நததாகையால் அதன் மீது இன்னும் சற்று அதிக கவனம் கொடுப்போம். இவ்விடத்தே சரங்களைத் தரக்க கூடிய தனியொரு எழுத்தினை அணுகும் முறைகுறித்தும் சரங்களின் ஊடே உள்ள துணைச் சரங்களை எடுக்கும் முறைதனையும் சரங்களை பின்மொழியும் வழி குறித்தும் கற்கலாம்.
2.3.1 தனியொரு எழுத்தினை அணுக
சரமொன்றின் ஊடே உள்ள நிலைகள் சுழியத்திலிருந்து துவங்கி எண்ணிடப் படுகின்றன. ஒரு சரத்தின் நிலையொன்றினை அணுக அந்நலையின் எண்ணினை கட்டடைப்புக் குறியில் கொடுப்பது வழக்கு. ஆனால் இது ஆங்கில எழுத்துக்களுக்கேப் பொருந்தும். தமிழ் எழுத்துக்கள் பொதுவாக மூன்று பைட்களில் உள்ளடக்கப் படுகின்றன. உதாரணத்திற்கு சரமொன்றிலுள்ள முதலெழுக்கு உரிய நிலைகள் சுழியத்திலிருந்து இரண்டு வரையாகும். கீழ் காணும் ஆங்கில உதாரணத்தைக் கவனிக்க:
>>> msg = 'Hello World' >>> msg[0] 'H' >>> msg[3] 'l' >>> msg[5] ' ' >>>
இதற்கு சரத்திற்கு அடையிடுதல் என்று பெயர். கட்டடைப்புக்கு உள்ளே நாம் தரும் எண்ணினை நிலை என்றழைப்போமாக. நம்மால் எழுத்துக்களை மாத்திரம் தான் தருவிக்க இயலும் எனக் கிடையாது மாறாக ஐந்தாம் நிலையிலுள்ள வெளியினையும் தருவிக்க முடியும். இது ஆங்கிலத்துக்கு சரி. தமிழுக்கு என்ன செய்வது?
குறிப்பு
வெற்றுச் சரமான க்கும் ' ' க்கும் உள்ள வேறுபாட்டினை அறிந்துணர்க.
சுழியத்திலிருந்து நிலையிடப் பட்டுள்ளது சுயமாக உணர்ந்நது கொள்வதில் சற்றே சிக்கல் ஏற்படுத்துவதாகத் தோனற்லாம். நிலைகளை எழுத்துகள் சரத்தில் இருக்கக் கூடிய உண்மையான இடத்திற்கு ஓரிடம் முன்னதாக தாங்கள் பாவிக்க வேண்டும். விளக்கப் படம் 2.1 இதற்கு துணைப் புரியும். ../images/indexing01.png
வி படம் 2.1: சர நிலைப்பாடு
சரி சரமொன்றின் எல்லைக்கப்பால் உள்ள நிலையொன்றினைத் தந்தால் என்னவாகும்?
>>> msg[11]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range >>>
'Hello World' எனும் சரத்திற்கு உட்பட்ட எழுத்துக்களுக்கான நிலைகளுக்கு அப்பால் உள்ளது 11. இது நெறி வழு அல்ல. ஆனால் நிரல் இயங்குகையில் கொண்டிருக்கையில் வழு ஏற்பட்டுள்ளது. Traceback தகவலின் படி எவ்வரியில் வழு ஏற்பட்டதென்று சுட்டப்படுகிறது (line 1 of "standard input"). அதனைத் தொடர்நது வழுவின் வகைக் குறித்தும் (IndexError) அதைப் பற்றிய சிறு விளக்கமும் தரப்பட்டுள்ளது.
சொல்லிலுள்ள எழுத்துக்களின் கடை நிலைக் குறித்து நாமறிவது எப்படி. சொல்/ சரத்தின் அளவு 'அ' என வைத்துக் கொண்டால் அதன் கடை நிலை 'அ-1' எனக் கொள்ளலாம். சரமொன்றின் அளவினைக் கணக்கிட len() செயற்பாடு உதவுகின்றது.
>>> len(msg)
11
>>>
எளிமையாகச் சொல்லவேண்டுமாயின் நாம் அழைக்கும் போது, நமது நிரலுக்காக ஒரு பணியினை செய்யக் கூடிய சிறு நிரட் துண்டுகளுக்கு செயற்பாடுகள் என்று பெயர். len() செற்பாட்டினை அதன் அடுத்தாற் போல் அடைப்புக் குறிகளிட்டு அளவுத் தேவைப்படும் சரத்தினைக் கொடுத்து அழைக்கிறோம். len() பைதான் ஒடுக்கியில் உள்ளமையால் அது பழுப்பு நிறமாகத் தெரிகிறது.
சரத்தின் எல்லைக்கு அப்பால் உள்ள நிலைதனைக் கொடுத்தால் என்னவாகும் எனப் பார்த்தோம். அதுவே துவக்க நிலைக்கு முன்னர் இருப்பின் என்னவாகும்? சுழிக்கும் முன்னே உள்ள மதிப்புகளைத் தருவதால் என்னவாகும் என்பதனை பார்ப்போமாக:
>>> msg[-1] 'd' >>>
இது பிழையெதையும் ஏற்படுத்தவில்லை. மாறாக சுழிமுன் நிலைகள் சரத்தின் முடிவிலிருந்து பணிபுரியத் துவங்குகின்றன. ஆக '-1' சரத்தின் கடைசி எழுத்தான 'd' தனைத் தருகிறது.
>>> msg[-3] 'r' >>> msg[-6] ' ' >>>
தற்போது கணினியானது நினைவின் எவ்விடத்தில் சரமிருக்கிறதோ அவ்விடத்தை அடைந்து அதிலிருந்து அதன் அளவைக் கூட்டி பின்னர் நிலை மதிப்பினைக் கழிக்கிறது. உ.ம்: 3136 + 11 -1 = 3146. இதனை படம் 2.2 நன்கு விளக்கும்.
../images/indexing02.png
படம் 2.2: Negative Indices
ஆக ஒரு சரத்தில் உள்ள எழுத்துக்களை அணுக நாம் இரண்டு முறைகளைக் கையாளலாம், துவக்கத்திலிருந்து அல்லது நிறைவிலிருந்து. உதாரணத்திற்கு நாம் Hello World என்பதில் உள்ள வெளியினை msg[5] அல்லது msg[-6] மூலம் கொண்டு வரலாம். இவை ஒரே இடத்தையே குறிக்கின்றன. ஏனெனில் 5 = len(msg) - 6.
2.3.2 துணைச்சரங்களை அணுக
இமொஆ கவில் ஒரு எழுத்தையும் தாண்டி ஒரு சேர பல எழுத்துக்களை நாம் ஒரே நேரத்தில் அணுக வேண்டியிருக்கும். இதுவும் சற்றே எளிமையானதுதான். உதாரணத்திற்கு, கீழ்காணும் சரம் நிலை 1 துவங்கி 4 க்கு முந்தையது வரை கொணர்ந்து கொடுக்கிறது.
>>> msg[1:4]
'ell'
>>>
:4 என வழங்குவதை சீவல் என்கிறோம். இங்கே 'e', 'l' மற்றும் 'l' தனைக் காண்கிறோம். இவை முறையே msg[1], msg[2] மற்றும் msg[3] க்கானது. ஆனால் msg[4] இல்லை. இது ஏனெனில் சீவல் முதல் நிலையில் துவங்கி கொடுக்கப்பட்ட நிலைக்கு ஒன்று முன்னதாக முடியும். This is consistent with indexing: indexing also starts from zero and goes up to one before the length of the string. We can see this by slicing with the value of len():
>>> len(msg)
11
>>> msg[0:11]
'Hello World'
>>>
We can also slice with negative indices — the same basic rule of starting from the start index and stopping one before the end index applies; here we stop before the space character:
>>> msg[0:-6]
'Hello'
>>>
Python provides two shortcuts for commonly used slice values. If the start index is 0 then you can leave it out, and if the end index is the length of the string then you can leave it out:
>>> msg[:3]
'Hel'
>>> msg[6:]
'World'
>>>
The first example above selects the first three characters from the string, and the second example selects from the character with index 6, namely 'W', to the end of the string. 2.3.3 Exercises
1.
☼ Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations. 2.
☼ Try the slice examples from this section using the interactive interpreter. Then try some more of your own. Guess what the result will be before executing the command. 3.
☼ We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat. 4.
☼ We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string? 5.
☼ We can also specify a "step" size for the slice. The following returns every second character within the slice, in a forward or reverse direction:
>>> msg[6:11:2]
'Wrd'
>>> msg[10:5:-2]
'drW'
>>>
Experiment with different step values. 6.
☼ What happens if you ask the interpreter to evaluate msg[::-1]? Explain why this is a reasonable result.
2.4 Strings, Sequences, and Sentences
We have seen how words like Hello can be stored as a string 'Hello'. Whole sentences can also be stored in strings, and manipulated as before, as we can see here for Chomsky's famous nonsense sentence:
>>> sent = 'colorless green ideas sleep furiously'
>>> sent[16:21]
'ideas'
>>> len(sent)
37
>>>
However, it turns out to be a bad idea to treat a sentence as a sequence of its characters, because this makes it too inconvenient to access the words. Instead, we would prefer to represent a sentence as a sequence of its words; as a result, indexing a sentence accesses the words, rather than characters. We will see how to do this now. 2.4.1 Lists
A list is designed to store a sequence of values. A list is similar to a string in many ways except that individual items don't have to be just characters; they can be arbitrary strings, integers or even other lists.
A Python list is represented as a sequence of comma-separated items, delimited by square brackets. Here are some lists:
>>> squares = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
>>> shopping_list = ['juice', 'muffins', 'bleach', 'shampoo']
We can also store sentences and phrases using lists. Let's create part of Chomsky's sentence as a list and put it in a variable cgi:
>>> cgi = ['colorless', 'green', 'ideas']
>>> cgi
['colorless', 'green', 'ideas']
>>>
Because lists and strings are both kinds of sequence, they can be processed in similar ways; just as strings support len(), indexing and slicing, so do lists. The following example applies these familiar operations to the list cgi:
>>> len(cgi)
3
>>> cgi[0]
'colorless'
>>> cgi[-1]
'ideas'
>>> cgi[-5]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: list index out of range >>>
Here, cgi[-5] generates an error, because the fifth-last item in a three item list would occur before the list started, i.e., it is undefined. We can also slice lists in exactly the same way as strings:
>>> cgi[1:3]
['green', 'ideas']
>>> cgi[-2:]
['green', 'ideas']
>>>
Lists can be concatenated just like strings. Here we will put the resulting list into a new variable chomsky. The original variable cgi is not changed in the process:
>>> chomsky = cgi + ['sleep', 'furiously']
>>> chomsky
['colorless', 'green', 'ideas', 'sleep', 'furiously']
>>> cgi
['colorless', 'green', 'ideas']
>>>
Now, lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements. Let's imagine that we want to change the 0th element of cgi to 'colorful', we can do that by assigning the new value to the index cgi[0]:
>>> cgi[0] = 'colorful'
>>> cgi
['colorful', 'green', 'ideas']
>>>
On the other hand if we try to do that with a string — changing the 0th character in msg to 'J' — we get:
>>> msg[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment >>>
This is because strings are immutable — you can't change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support a number of operations, or methods, that modify the original value rather than returning a new value. A method is a function that is associated with a particular object. A method is called on the object by giving the object's name, then a period, then the name of the method, and finally the parentheses containing any arguments. For example, in the following code we use the sort() and reverse() methods:
>>> chomsky.sort()
>>> chomsky.reverse()
>>> chomsky
['sleep', 'ideas', 'green', 'furiously', 'colorless']
>>>
As you will see, the prompt reappears immediately on the line after chomsky.sort() and chomsky.reverse(). That is because these methods do not produce a new list, but instead modify the original list stored in the variable chomsky.
Lists also have an append() method for adding items to the end of the list and an index() method for finding the index of particular items in the list:
>>> chomsky.append('said')
>>> chomsky.append('Chomsky')
>>> chomsky
['sleep', 'ideas', 'green', 'furiously', 'colorless', 'said', 'Chomsky']
>>> chomsky.index('green')
2
>>>
Finally, just as a reminder, you can create lists of any values you like. As you can see in the following example for a lexical entry, the values in a list do not even have to have the same type (though this is usually not a good idea, as we will explain in Section 6.2).
>>> bat = ['bat', [[1, 'n', 'flying mammal'], [2, 'n', 'striking instrument']]]
>>>
2.4.2 Working on Sequences One Item at a Time
We have shown you how to create lists, and how to index and manipulate them in various ways. Often it is useful to step through a list and process each item in some way. We do this using a for loop. This is our first example of a control structure in Python, a statement that controls how other statements are run:
>>> for num in [1, 2, 3]:
... print 'The number is', num
...
The number is 1
The number is 2
The number is 3
The interactive interpreter changes the prompt from >>> to ... after encountering the colon at the end of the first line. This prompt indicates that the interpreter is expecting an indented block of code to appear next. However, it is up to you to do the indentation. To finish the indented block just enter a blank line.
The for loop has the general form: for variable in sequence followed by a colon, then an indented block of code. The first time through the loop, the variable is assigned to the first item in the sequence, i.e. num has the value 1. This program runs the statement print 'The number is', num for this value of num, before returning to the top of the loop and assigning the second item to the variable. Once all items in the sequence have been processed, the loop finishes.
Now let's try the same idea with a list of words:
>>> chomsky = ['colorless', 'green', 'ideas', 'sleep', 'furiously']
>>> for word in chomsky:
... print len(word), word[-1], word
...
9 s colorless
5 n green
5 s ideas
5 p sleep
9 y furiously
The first time through this loop, the variable is assigned the value 'colorless'. This program runs the statement print len(word), word[-1], word for this value, to produce the output line: 9 s colorless. This process is known as iteration. Each iteration of the for loop starts by assigning the next item of the list chomsky to the loop variable word. Then the indented body of the loop is run. Here the body consists of a single command, but in general the body can contain as many lines of code as you want, so long as they are all indented by the same amount. (We recommend that you always use exactly 4 spaces for indentation, and that you never use tabs.)
We can run another for loop over the Chomsky nonsense sentence, and calculate the average word length. As you will see, this program uses the len() function in two ways: to count the number of characters in a word, and to count the number of words in a phrase. Note that x += y is shorthand for x = x + y; this idiom allows us to increment the total variable each time the loop is run.
>>> total = 0
>>> for word in chomsky:
... total += len(word)
...
>>> total / len(chomsky)
6
>>>
We can also write for loops to iterate over the characters in strings. This print statement ends with a trailing comma, which is how we tell Python not to print a newline at the end.
>>> sent = 'colorless green ideas sleep furiously'
>>> for char in sent:
... print char,
...
c o l o r l e s s g r e e n i d e a s s l e e p f u r i o u s l y
A note of caution: we have now iterated over words and characters, using expressions like for word in sent: and for char in sent:. Remember that, to Python, word and char are meaningless variable names, and we could have written for foo123 in sent:. The interpreter simply iterates over the items in the sequence, quite oblivious to what kind of object they represent, e.g.:
>>> for foo123 in 'colorless green ideas sleep furiously':
... print foo123,
...
c o l o r l e s s g r e e n i d e a s s l e e p f u r i o u s l y
>>> for foo123 in ['colorless', 'green', 'ideas', 'sleep', 'furiously']:
... print foo123,
...
colorless green ideas sleep furiously
>>>
2.4.3 String Formatting
The output of a program is usually structured to make the information easily digestible by a reader. Instead of running some code and then manually inspecting the contents of a variable, we would like the code to tabulate some output. We already saw this above in the first for loop example that used a list of words, where each line of output was similar to 5 p sleep, consisting of a word length, the last character of the word, then the word itself.
There are many ways we might want to format such output. For instance, we might want to place the length value in parentheses after the word, and print all the output on a single line:
>>> for word in chomsky:
... print word, '(', len(word), '),',
colorless ( 9 ), green ( 5 ), ideas ( 5 ), sleep ( 5 ), furiously ( 9 ),
However, this approach has a couple of problems. First, the print statement intermingles variables and punctuation, making it a little difficult to read. Second, the output has spaces around every item that was printed. A cleaner way to produce structured output uses Python's string formatting expressions. Before diving into clever formatting tricks, however, let's look at a really simple example. We are going to use a special symbol, %s, as a placeholder in strings. Once we have a string containing this placeholder, we follow it with a single % and then a value v. Python then returns a new string where v has been slotted in to replace %s:
>>> "I want a %s right now" % "drink"
'I want a drink right now'
In fact, we can have a number of placeholders, but following the % operator we need to put in a tuple with exactly the same number of values:
>>> "%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
'Lee wants a sandwich for lunch'
We can also provide the values for the placeholders indirectly. Here's an example using a for loop:
>>> menu = ['sandwich', 'spam fritter', 'pancake']
>>> for snack in menu:
... "Lee wants a %s right now" % snack
...
'Lee wants a sandwich right now'
'Lee wants a spam fritter right now'
'Lee wants a pancake right now'
We oversimplified things when we said that placeholders were of the form %s; in fact, this is a complex object, called a conversion specifier. This has to start with the % character, and ends with conversion character such as s` or ``d. The %s specifier tells Python that the corresponding variable is a string (or should be converted into a string), while the %d specifier indicates that the corresponding variable should be converted into a decimal representation. The string containing conversion specifiers is called a format string.
Picking up on the print example that we opened this section with, here's how we can use two different kinds of conversion specifier:
>>> for word in chomsky:
... print "%s (%d)," % (word, len(word)),
colorless (9), green (5), ideas (5), sleep (5), furiously (9),
To summarize, string formatting is accomplished with a three-part object having the syntax: format % values. The format section is a string containing format specifiers such as %s and %d that Python will replace with the supplied values. The values section of a formatting string is a tuple containing exactly as many items as there are format specifiers in the format section. In the case that there is just one item, the parentheses can be left out. (We will discuss Python's string-formatting expressions in more detail in Section 6.3.2).
In the above example, we used a trailing comma to suppress the printing of a newline. Suppose, on the other hand, that we want to introduce some additional newlines in our output. We can accomplish this by inserting the "special" character \n into the print string:
>>> for word in chomsky:
... print "Word = %s\nIndex = %s\n*****" % (word, chomsky.index(word))
...
Word = colorless
Index = 0
Word = green Index = 1
Word = ideas Index = 2
Word = sleep Index = 3
Word = furiously Index = 4
>>>
2.4.4 Character Encoding and Unicode
Our programs will often need to deal with different languages, and different character sets. The concept of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and Breton, and "ň" for Czech and Slovak. 2.4.5 Converting Between Strings and Lists
Often we want to convert between a string containing a space-separated list of words and a list of strings. Let's first consider turning a list into a string. One way of doing this is as follows:
>>> s =
>>> for word in chomsky:
... s += ' ' + word
...
>>> s
' colorless green ideas sleep furiously'
>>>
One drawback of this approach is that we have an unwanted space at the start of s. It is more convenient to use the join() method. We specify the string to be used as the "glue", followed by a period, followed by the join() function.
>>> sent = ' '.join(chomsky)
>>> sent
'colorless green ideas sleep furiously'
>>>
Now let's try to reverse the process: that is, we want to convert a string into a list. Again, we could start off with an empty list [] and append() to it within a for loop. But as before, there is a more succinct way of achieving the same goal. This time, we will split the new string sent on whitespace:
>>> sent.split(' ')
['colorless', 'green', 'ideas', 'sleep', 'furiously']
>>>
To consolidate your understanding of joining and splitting strings, let's try the same thing using a semicolon as the separator:
>>> sent = ';'.join(chomsky)
>>> sent
'colorless;green;ideas;sleep;furiously'
>>> sent.split(';')
['colorless', 'green', 'ideas', 'sleep', 'furiously']
2.4.6 Exercises
1. ☼ Using the Python interactive interpreter, experiment with the examples in this section. Think of a sentence and represent it as a list of strings, e.g. ['Hello', 'world']. Try the various operations for indexing, slicing and sorting the elements of your list. Extract individual items (strings), and perform some of the string operations on them.
2. ☼ Split sent on some other character, such as 's'.
3. ☼ We pointed out that when phrase is a list, phrase.reverse() returns a modified version of phrase rather than a new list. On the other hand, we can use the slice trick mentioned in the exercises for the previous section, [::-1] to create a new reversed list without changing phrase. Show how you can confirm this difference in behavior.
4. ☼ We have seen how to represent a sentence as a list of words, where each word is a sequence of characters. What does phrase1[2][2] do? Why? Experiment with other index values.
5. ☼ Write a for loop to print out the characters of a string, one per line.
6. ☼ What happens if you call split on a string, with no argument, e.g. sent.split()? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use '\t' to enter a tab character.)
7. ☼ Create a variable words containing a list of words. Experiment with words.sort() and sorted(words). What is the difference?
8. ◑ Process the list chomsky using a for loop, and store the result in a new list lengths. Hint: begin by assigning the empty list to lengths, using lengths = []. Then each time through the loop, use append() to add another length value to the list.
9. ◑ Define a variable silly to contain the string: 'newly formed bland ideas are inexpressible in an infuriating way'. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous phrase, according to Wikipedia). Now write code to perform the following tasks:
1. Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called bland.
2. Extract the second letter of each word in silly and join them into a string, to get 'eoldrnnnna'.
3. Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
4. Print the words of silly in alphabetical order, one per line.
10. ◑ The index() function can be used to look up items in sequences. For example, 'inexpressible'.index('e') tells us the index of the first position of the letter e.
1. What happens when you look up a substring, e.g. 'inexpressible'.index('re')?
2. Define a variable words containing a list of words. Now use words.index() to look up the position of an individual word.
3. Define a variable silly as in the exercise above. Use the index() function in combination with list slicing to build a list phrase consisting of all the words up to (but not including) in in silly.
2.5 Making Decisions
So far, our simple programs have been able to manipulate sequences of words, and perform some operation on each one. We applied this to lists consisting of a few words, but the approach works the same for lists of arbitrary size, containing thousands of items. Thus, such programs have some interesting qualities: (i) the ability to work with language, and (ii) the potential to save human effort through automation. Another useful feature of programs is their ability to make decisions on our behalf; this is our focus in this section. 2.5.1 Making Simple Decisions
Most programming languages permit us to execute a block of code when a conditional expression, or if statement, is satisfied. In the following program, we have created a variable called word containing the string value 'cat'. The if statement then checks whether the condition len(word) < 5 is true. Because the conditional expression is true, the body of the if statement is invoked and the print statement is executed.
>>> word = "cat"
>>> if len(word) < 5:
... print 'word length is less than 5'
...
word length is less than 5
>>>
If we change the conditional expression to len(word) >= 5, to check that the length of word is greater than or equal to 5, then the conditional expression will no longer be true, and the body of the if statement will not be run:
>>> if len(word) >= 5:
... print 'word length is greater than or equal to 5'
...
>>>
The if statement, just like the for statement above is a control structure. An if statement is a control structure because it controls whether the code in the body will be run. You will notice that both if and for have a colon at the end of the line, before the indentation begins. That's because all Python control structures end with a colon.
What if we want to do something when the conditional expression is not true? The answer is to add an else clause to the if statement:
>>> if len(word) >= 5:
... print 'word length is greater than or equal to 5'
... else:
... print 'word length is less than 5'
...
word length is less than 5
>>>
Finally, if we want to test multiple conditions in one go, we can use an elif clause that acts like an else and an if combined:
>>> if len(word) < 3:
... print 'word length is less than three'
... elif len(word) == 3:
... print 'word length is equal to three'
... else:
... print 'word length is greater than three'
...
word length is equal to three
>>>
2.5.2 Conditional Expressions
Python supports a wide range of operators like < and >= for testing the relationship between values. The full set of these relational operators are shown in Table inequalities.
Table 2.1 Operator Relationship < less than <= less than or equal to == equal to (note this is two not one = sign) != not equal to > greater than >= greater than or equal to
Conditional Expressions
Normally we use conditional expressions as part of an if statement. However, we can test these relational operators directly at the prompt:
>>> 3 < 5
True
>>> 5 < 3
False
>>> not 5 < 3
True
>>>
Here we see that these expressions have Boolean values, namely True or False. not is a Boolean operator, and flips the truth value of Boolean statement.
Strings and lists also support conditional operators:
>>> word = 'sovereignty'
>>> 'sovereign' in word
True
>>> 'gnt' in word
True
>>> 'pre' not in word
True
>>> 'Hello' in ['Hello', 'World']
True
>>> 'Hell' in ['Hello', 'World']
False
>>>
Strings also have methods for testing what appears at the beginning and the end of a string (as opposed to just anywhere in the string:
>>> word.startswith('sovereign')
True
>>> word.endswith('ty')
True
>>>
Note
Integers, strings and lists are all kinds of data types in Python. In fact, every value in Python has a type. The type determines what operations you can perform on the data value. So, for example, we have seen that we can index strings and lists, but we can't index integers:
>>> one = 'cat'
>>> one[0]
'c'
>>> two = [1, 2, 3]
>>> two[1]
2
>>> three = 1234
>>> three[2]
Traceback (most recent call last):
File "<pyshell#95>", line 1, in -toplevel- three[2]
TypeError: 'int' object is unsubscriptable >>>
2.5.3 Iteration, Items, and if
Now it is time to put some of the pieces together. We are going to take the string 'how now brown cow' and print out all of the words ending in 'ow'. Let's build the program up in stages. The first step is to split the string into a list of words:
>>> sentence = 'how now brown cow'
>>> words = sentence.split()
>>> words
['how', 'now', 'brown', 'cow']
>>>
Next, we need to iterate over the words in the list. Just so we don't get ahead of ourselves, let's print each word, one per line:
>>> for word in words:
... print word
...
how
now
brown
cow
The next stage is to only print out the words if they end in the string 'ow'. Let's check that we know how to do this first:
>>> 'how'.endswith('ow')
True
>>> 'brown'.endswith('ow')
False
>>>
Now we are ready to put an if statement inside the for loop. Here is the complete program:
>>> sentence = 'how now brown cow'
>>> words = sentence.split()
>>> for word in words:
... if word.endswith('ow'):
... print word
...
how
now
cow
>>>
As you can see, even with this small amount of Python knowledge it is possible to develop useful programs. The key idea is to develop the program in pieces, testing that each one does what you expect, and then combining them to produce whole programs. This is why the Python interactive interpreter is so invaluable, and why you should get comfortable using it. 2.5.4 Exercises
1.
☼ Assign a new value to sentence, namely the string 'she sells sea shells by the sea shore', then write code to perform the following tasks:
1. Print all words beginning with 'sh':
2. Print all words longer than 4 characters.
3. Generate a new sentence that adds the popular hedge word 'like' before every word beginning with 'se'. Your result should be a single string.
2.
☼ Write code to abbreviate text by removing all the vowels. Define sentence to hold any string you like, then initialize a new string result to hold the empty string . Now write a for loop to process the string, one character at a time, and append any non-vowel characters to the result string. 3.
◑ Write conditional expressions, such as 'H' in msg, but applied to lists instead of strings. Check whether particular words are included in the Chomsky nonsense sentence. 4.
◑ Write code to convert text into hAck3r, where characters are mapped according to the following table:
Table 2.2
Input:
e
i
o
l
s
.
ate
Output:
3
1
0
|
5
5w33t!
8
2.6 Getting Organized
Strings and lists are a simple way to organize data. In particular, they map from integers to values. We can "look up" a character in a string using an integer, and we can look up a word in a list of words using an integer. These cases are shown in Figure 2.3. ../images/maps01.png
Figure 2.3: Sequence Look-up
However, we need a more flexible way to organize and access our data. Consider the examples in Figure 2.4. ../images/maps02.png
Figure 2.4: Dictionary Look-up
In the case of a phone book, we look up an entry using a name, and get back a number. When we type a domain name in a web browser, the computer looks this up to get back an IP address. A word frequency table allows us to look up a word and find its frequency in a text collection. In all these cases, we are mapping from names to numbers, rather than the other way round as with indexing into sequences. In general, we would like to be able to map between arbitrary types of information. Table linguistic-objects lists a variety of linguistic objects, along with what they map.
Table 2.3 Linguistic Object Maps from to Document Index Word List of pages (where word is found) Thesaurus Word sense List of synonyms Dictionary Headword Entry (part of speech, sense definitions, etymology) Comparative Wordlist Gloss term Cognates (list of words, one per language) Morph Analyzer Surface form Morphological analysis (list of component morphemes)
Linguistic Objects as Mappings from Keys to Values
Most often, we are mapping from a string to some structured object. For example, a document index maps from a word (which we can represent as a string), to a list of pages (represented as a list of integers). In this section, we will see how to represent such mappings in Python. 2.6.1 Accessing Data with Data
Python provides a dictionary data type that can be used for mapping between arbitrary types.
Note
A Python dictionary is somewhat like a linguistic dictionary — they both give you a systematic means of looking things up, and so there is some potential for confusion. However, we hope that it will usually be clear from the context which kind of dictionary we are talking about.
Here we define pos to be an empty dictionary and then add three entries to it, specifying the part-of-speech of some words. We add entries to a dictionary using the familiar square bracket notation:
>>> pos = {}
>>> pos['colorless'] = 'adj'
>>> pos['furiously'] = 'adv'
>>> pos['ideas'] = 'n'
>>>
So, for example, pos['colorless'] = 'adj' says that the look-up value of 'colorless' in pos is the string 'adj'.
To look up a value in pos, we again use indexing notation, except now the thing inside the square brackets is the item whose value we want to recover:
>>> pos['ideas']
'n'
>>> pos['colorless']
'adj'
>>>
The item used for look-up is called the key, and the data that is returned is known as the value. As with indexing a list or string, we get an exception when we try to access the value of a key that does not exist:
>>> pos['missing']
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 'missing' >>>
This raises an important question. Unlike lists and strings, where we can use len() to work out which integers will be legal indices, how do we work out the legal keys for a dictionary? Fortunately, we can check whether a key exists in a dictionary using the in operator:
>>> 'colorless' in pos
True
>>> 'missing' in pos
False
>>> 'missing' not in pos
True
>>>
Notice that we can use not in to check if a key is missing. Be careful with the in operator for dictionaries: it only applies to the keys and not their values. If we check for a value, e.g. 'adj' in pos, the result is False, since 'adj' is not a key. We can loop over all the entries in a dictionary using a for loop.
>>> for word in pos:
... print "%s (%s)" % (word, pos[word])
...
colorless (adj)
furiously (adv)
ideas (n)
>>>
We can see what the contents of the dictionary look like by inspecting the variable pos. Note the presence of the colon character to separate each key from its corresponding value:
>>> pos
{'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
>>>
Here, the contents of the dictionary are shown as key-value pairs. As you can see, the order of the key-value pairs is different from the order in which they were originally entered. This is because dictionaries are not sequences but mappings. The keys in a mapping are not inherently ordered, and any ordering that we might want to impose on the keys exists independently of the mapping. As we shall see later, this gives us a lot of flexibility.
We can use the same key-value pair format to create a dictionary:
>>> pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
>>>
Using the dictionary methods keys(), values() and items(), we can access the keys and values as separate lists, and also the key-value pairs:
>>> pos.keys()
['colorless', 'furiously', 'ideas']
>>> pos.values()
['adj', 'adv', 'n']
>>> pos.items()
[('colorless', 'adj'), ('furiously', 'adv'), ('ideas', 'n')]
>>> for (key, val) in pos.items():
... print "%s ==> %s" % (key, val)
...
colorless ==> adj
furiously ==> adv
ideas ==> n
>>>
Note that keys are forced to be unique. Suppose we try to use a dictionary to store the fact that the word content is both a noun and a verb:
>>> pos['content'] = 'n'
>>> pos['content'] = 'v'
>>> pos
{'content': 'v', 'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
>>>
Initially, pos['content'] is given the value 'n', and this is immediately overwritten with the new value 'v'. In other words, there is only one entry for 'content'. If we wanted to store multiple values in that entry, we could use a list, e.g. pos['content'] = ['n', 'v']. 2.6.2 Counting with Dictionaries
The values stored in a dictionary can be any kind of object, not just a string — the values can even be dictionaries. The most common kind is actually an integer. It turns out that we can use a dictionary to store counters for many kinds of data. For instance, we can have a counter for all the letters of the alphabet; each time we get a certain letter we increment its corresponding counter:
>>> phrase = 'colorless green ideas sleep furiously'
>>> count = {}
>>> for letter in phrase:
... if letter not in count:
... count[letter] = 0
... count[letter] += 1
>>> count
{'a': 1, ' ': 4, 'c': 1, 'e': 6, 'd': 1, 'g': 1, 'f': 1, 'i': 2,
'l': 4, 'o': 3, 'n': 1, 'p': 1, 's': 5, 'r': 3, 'u': 2, 'y': 1}
>>>
Observe that in is used here in two different ways: for letter in phrase iterates over every letter, running the body of the for loop. Inside this loop, the conditional expression if letter not in count checks whether the letter is missing from the dictionary. If it is missing, we create a new entry and set its value to zero: count[letter] = 0. Now we are sure that the entry exists, and it may have a zero or non-zero value. We finish the body of the for loop by incrementing this particular counter using the += assignment operator. Finally, we print the dictionary, to see the letters and their counts. This method of maintaining many counters will find many uses, and you will become very familiar with it. To make counting much easier, we can use defaultdict, a special kind of container introduced in Python 2.5 that is included in NLTK for the benefit of readers who are using Python 2.4 (we will explain the import statement later):
>>> phrase = 'colorless green ideas sleep furiously'
>>> from nltk import defaultdict
>>> count = defaultdict(int)
>>> for letter in phrase:
... count[letter] += 1
>>> count
{'a': 1, ' ': 4, 'c': 1, 'e': 6, 'd': 1, 'g': 1, 'f': 1, 'i': 2,
'l': 4, 'o': 3, 'n': 1, 'p': 1, 's': 5, 'r': 3, 'u': 2, 'y': 1}
>>>
Note
Calling defaultdict(int) creates a special kind of dictionary. When that dictionary is accessed with a non-existent key — i.e. the first time a particular letter is encountered — then int() is called to produce the initial value for this key (i.e. 0). You can test this by running the above code, then typing count['X'] and seeing that it returns a zero value (and not a KeyError as in the case of normal Python dictionaries). The function defaultdict is very handy and will be used in many places later on.
There are other useful ways to display the result, such as sorting alphabetically by the letter:
>>> sorted(count.items())
[(' ', 4), ('a', 1), ('c', 1), ('d', 1), ('e', 6), ('f', 1), ...,
...('y', 1)]
>>>
Note
The function sorted() is similar to the sort() method on sequences, but rather than sorting in-place, it produces a new sorted copy of its argument. Moreover, as we will see very soon, sorted() will work on a wider variety of data types, including dictionaries. 2.6.3 Getting Unique Entries
Sometimes, we don't want to count at all, but just want to make a record of the items that we have seen, regardless of repeats. For example, we might want to compile a vocabulary from a document. This is a sorted list of the words that appeared, regardless of frequency. At this stage we have two ways to do this. The first uses lists.
>>> sentence = "she sells sea shells by the sea shore".split()
>>> words = []
>>> for word in sentence:
... if word not in words:
... words.append(word)
...
>>> sorted(words)
['by', 'sea', 'sells', 'she', 'shells', 'shore', 'the']
>>>
There is a better way to do this task using Python's set data type. We can convert sentence into a set, using set(sentence):
>>> set(sentence)
set(['shells', 'sells', 'shore', 'she', 'sea', 'the', 'by'])
>>>
The order of items in a set is not significant, and they will usually appear in a different order to the one they were entered in. The main point here is that converting a list to a set removes any duplicates. We convert it back into a list, sort it, and print. Here is the complete program:
>>> sentence = "she sells sea shells by the sea shore".split()
>>> sorted(set(sentence))
['by', 'sea', 'sells', 'she', 'shells', 'shore', 'the']
Here we have seen that there is sometimes more than one way to solve a problem with a program. In this case, we used three different built-in data types, a list, a dictionary, and a set. The set data type mostly closely modeled our task, so it required the least amount of work. 2.6.4 Scaling Up
We can use dictionaries to count word occurrences. For example, the following code uses NLTK's corpus reader to load Macbeth and count the frequency of each word. Before we can use NLTK we need to tell Python to load it, using a special statement import nltk.
>>> import nltk
>>> count = nltk.defaultdict(int) # initialize a dictionary
>>> for word in nltk.corpus.gutenberg.words('shakespeare-macbeth'): # tokenize Macbeth
... word = word.lower() # normalize to lowercase
... count[word] += 1 # increment the counter
...
>>>
You will learn more about accessing corpora in Section 3.2.3. For now, you just need to know that gutenberg.words() returns a list of words, in this case from Shakespeare's play Macbeth, and we are iterating over this list using a for loop. We convert each word to lowercase using the string method word.lower(), and use a dictionary to maintain a set of counters, one per word. Now we can inspect the contents of the dictionary to get counts for particular words:
>>> count['scotland']
12
>>> count['the']
692
>>>
2.6.5 Exercises
1. ☼ Using the Python interpreter in interactive mode, experiment with the examples in this section. Create a dictionary d, and add some entries. What happens if you try to access a non-existent entry, e.g. d['xyz']? 2. ☼ Try deleting an element from a dictionary, using the syntax del d['abc']. Check that the item was deleted. 3. ☼ Create a dictionary e, to represent a single lexical entry for some word of your choice. Define keys like headword, part-of-speech, sense, and example, and assign them suitable values. 4. ☼ Create two dictionaries, d1 and d2, and add some entries to each. Now issue the command d1.update(d2). What did this do? What might it be useful for? 5. ◑ Write a program that takes a sentence expressed as a single string, splits it and counts up the words. Get it to print out each word and the word's frequency, one per line, in alphabetical order.
2.7 Regular Expressions
For a moment, imagine that you are editing a large text, and you have strong dislike of repeated occurrences of the word very. How could you find all such cases in the text? To be concrete, let's suppose that we assign the following text to the variable s:
>>> s = """Google Analytics is very very very nice (now)
... By Jason Hoffman 18 August 06
... Google Analytics, the result of Google's acquisition of the San
... Diego-based Urchin Software Corporation, really really opened its
... doors to the world a couple of days ago, and it allows you to
... track up to 10 sites within a single google account.
... """
>>>
Python's triple quotes """ are used here since they allow us to break a string across lines.
One approach to our task would be to convert the string into a list, and look for adjacent items that are both equal to the string 'very'. We use the range(n) function in this example to create a list of consecutive integers from 0 up to, but not including, n:
>>> text = s.split()
>>> for n in range(len(text)):
... if text[n] == 'very' and text[n+1] == 'very':
... print n, n+1
...
3 4
4 5
>>>
However, such an approach is not very flexible or convenient. In this section, we will present Python's regular expression module re, which supports powerful search and substitution inside strings. As a gentle introduction, we will start out using a utility function re_show() to illustrate how regular expressions match against substrings. re_show() takes two arguments, a pattern that it is looking for, and a string in which the pattern might occur.
>>> import nltk
>>> nltk.re_show('very very', s)
Google Analytics is {very very} very nice (now)
...
>>>
(We have only displayed the first part of s that is returned, since the rest is irrelevant for the moment.) As you can see, re_show places curly braces around the first occurrence it has found of the string 'very very'. So an important part of what re_show is doing is searching for any substring of s that matches the pattern in its first argument.
Now we might want to modify the example so that re_show highlights cases where there are two or more adjacent sequences of 'very'. To do this, we need to use a regular expression operator, namely '+'. If s is a string, then s+ means: 'match one or more occurrences of s'. Let's first look at the case where s is a single character, namely the letter 'o':
>>> nltk.re_show('o+', s)
G{oo}gle Analytics is very very very nice (n{o}w)
...
>>>
'o+' is our first proper regular expression. You can think of it as matching an infinite set of strings, namely the set {'o', 'oo', 'ooo', ...}. But we would really like to match sequences of least two 'o's; for this, we need the regular expression 'oo+', which matches any string consisting of 'o' followed by one or more occurrences of o.
>>> nltk.re_show('oo+', s)
G{oo}gle Analytics is very very very nice (now)
...
>>>
Let's return to the task of identifying multiple occurrences of 'very'. Some initially plausible candidates won't do what we want. For example, 'very+' would match 'veryyy' (but not 'very very'), since the + scopes over the immediately preceding expression, in this case 'y'. To widen the scope of +, we need to use parentheses, as in '(very)+'. Will this match 'very very'? No, because we've forgotten about the whitespace between the two words; instead, it will match strings like 'veryvery'. However, the following does work:
>>> nltk.re_show('(very\s)+', s)
Google Analytics is {very very very }nice (now)
>>>
Characters preceded by a \, such as '\s', have a special interpretation inside regular expressions; thus, '\s' matches a whitespace character. We could have used ' ' in our pattern, but '\s' is better practice in general. One reason is that the sense of "whitespace" we are using is more general than you might have imagined; it includes not just inter-word spaces, but also tabs and newlines. If you try to inspect the variable s, you might initially get a shock:
>>> s
"Google Analytics is very very very nice (now)\nBy Jason Hoffman
18 August 06\nGoogle
...
>>>
You might recall that '\n' is a special character that corresponds to a newline in a string. The following example shows how newline is matched by '\s'.
>>> s2 = "I'm very very\nvery happy"
>>> nltk.re_show('very\s', s2)
I'm {very }{very
}{very }happy
>>>
Python's re.findall(patt, s) function is a useful way to find all the substrings in s that are matched by patt. Before illustrating, let's introduce two further special characters, '\d' and '\w': the first will match any digit, and the second will match any alphanumeric character. Before we can use re.findall() we have to load Python's regular expression module, using import re.
>>> import re
>>> re.findall('\d\d', s)
['18', '06', '10']
>>> re.findall('\s\w\w\w\s', s)
[' the ', ' the ', ' its\n', ' the ', ' and ', ' you ']
>>>
As you will see, the second example matches three-letter words. However, this regular expression is not quite what we want. First, the leading and trailing spaces are extraneous. Second, it will fail to match against strings such as 'the San', where two three-letter words are adjacent. To solve this problem, we can use another special character, namely '\b'. This is sometimes called a "zero-width" character; it matches against the empty string, but only at the beginning and end of words:
>>> re.findall(r'\b\w\w\w\b', s)
['now', 'the', 'the', 'San', 'its', 'the', 'ago', 'and', 'you']
Note
This example uses a Python raw string: r'\b\w\w\w\b'. The specific justification here is that in an ordinary string, \b is interpreted as a backspace character. Python will convert it to a backspace in a regular expression unless you use the r prefix to create a raw string as shown above. Another use for raw strings is to match strings that include backslashes. Suppose we want to match 'either\or'. In order to create a regular expression, the backslash needs to be escaped, since it is a special character; so we want to pass the pattern \\ to the regular expression interpreter. But to express this as a Python string literal, each backslash must be escaped again, yielding the string '\\\\'. However, with a raw string, this reduces down to r'\\'.
Returning to the case of repeated words, we might want to look for cases involving 'very' or 'really', and for this we use the disjunction operator |.
>>> nltk.re_show('((very|really)\s)+', s)
Google Analytics is {very very very }nice (now)
By Jason Hoffman 18 August 06
Google Analytics, the result of Google's acquisition of the San
Diego-based Urchin Software Corporation, {really really }opened its
doors to the world a couple of days ago, and it allows you to
track up to 10 sites within a single google account.
>>>
In addition to the matches just illustrated, the regular expression '((very|really)\s)+' will also match cases where the two disjuncts occur with each other, such as the string 'really very really '.
Let's now look at how to perform substitutions, using the re.sub() function. In the first instance we replace all instances of l with s. Note that this generates a string as output, and doesn't modify the original string. Then we replace any instances of green with red.
>>> sent = "colorless green ideas sleep furiously"
>>> re.sub('l', 's', sent)
'cosorsess green ideas sseep furioussy'
>>> re.sub('green', 'red', sent)
'colorless red ideas sleep furiously'
>>>
We can also disjoin individual characters using a square bracket notation. For example, [aeiou] matches any of a, e, i, o, or u, that is, any vowel. The expression [^aeiou] matches any single character that is not a vowel. In the following example, we match sequences consisting of a non-vowel followed by a vowel.
>>> nltk.re_show('[^aeiou][aeiou]', sent)
{co}{lo}r{le}ss g{re}en{ i}{de}as s{le}ep {fu}{ri}ously
>>>
Using the same regular expression, the function re.findall() returns a list of all the substrings in sent that are matched:
>>> re.findall('[^aeiou][aeiou]', sent)
['co', 'lo', 'le', 're', ' i', 'de', 'le', 'fu', 'ri']
>>>
2.7.1 Groupings
Returning briefly to our earlier problem with unwanted whitespace around three-letter words, we note that re.findall() behaves slightly differently if we create groups in the regular expression using parentheses; it only returns strings that occur within the groups:
>>> re.findall('\s(\w\w\w)\s', s)
['the', 'the', 'its', 'the', 'and', 'you']
>>>
The same device allows us to select only the non-vowel characters that appear before a vowel:
>>> re.findall('([^aeiou])[aeiou]', sent)
['c', 'l', 'l', 'r', ' ', 'd', 'l', 'f', 'r']
>>>
By delimiting a second group in the regular expression, we can even generate pairs (or tuples) that we may then go on and tabulate.
>>> re.findall('([^aeiou])([aeiou])', sent)
[('c', 'o'), ('l', 'o'), ('l', 'e'), ('r', 'e'), (' ', 'i'),
('d', 'e'), ('l', 'e'), ('f', 'u'), ('r', 'i')]
>>>
Our next example also makes use of groups. One further special character is the so-called wildcard element, '.'; this has the distinction of matching any single character (except '\n'). Given the string s3, our task is to pick out login names and email domains:
>>> s3 = """
... <hart@vmd.cso.uiuc.edu>
... Final editing was done by Martin Ward <Martin.Ward@uk.ac.durham>
... Michael S. Hart <hart@pobox.com>
... Prepared by David Price, email <ccx074@coventry.ac.uk>"""
The task is made much easier by the fact that all the email addresses in the example are delimited by angle brackets, and we can exploit this feature in our regular expression:
>>> re.findall(r'<(.+)@(.+)>', s3)
[('hart', 'vmd.cso.uiuc.edu'), ('Martin.Ward', 'uk.ac.durham'),
('hart', 'pobox.com'), ('ccx074', 'coventry.ac.uk')]
>>>
Since '.' matches any single character, '.+' will match any non-empty string of characters, including punctuation symbols such as the period.
One question that might occur to you is how do we specify a match against a period? The answer is that we have to place a '\' immediately before the '.' in order to escape its special interpretation.
>>> re.findall(r'(\w+\.)', s3)
['vmd.', 'cso.', 'uiuc.', 'Martin.', 'uk.', 'ac.', 'S.',
'pobox.', 'coventry.', 'ac.']
>>>
Now, let's suppose that we wanted to match occurrences of both 'Google' and 'google' in our sample text. If you have been following up till now, you would reasonably expect that this regular expression with a disjunction would do the trick: '(G|g)oogle'. But look what happens when we try this with re.findall():
>>> re.findall('(G|g)oogle', s)
['G', 'G', 'G', 'g']
>>>
What is going wrong? We innocently used the parentheses to indicate the scope of the operator '|', but re.findall() has interpreted them as marking a group. In order to tell re.findall() "don't try to do anything special with these parentheses", we need an extra piece of notation:
>>> re.findall('(?:G|g)oogle', s)
['Google', 'Google', 'Google', 'google']
>>>
Placing '?:' immediately after the opening parenthesis makes it explicit that the parentheses are just being used for scoping. 2.7.2 Practice Makes Perfect
Regular expressions are very flexible and very powerful. However, they often don't do what you expect. For this reason, you are strongly encouraged to try out a variety of tasks using re_show() and re.findall() in order to develop your intuitions further; the exercises below should help get you started. We suggest that you build up a regular expression in small pieces, rather than trying to get it completely right first time. Here are some operators and sequences that are commonly used in natural language processing.
Table 2.4 Commonly-used Operators and Sequences * Zero or more, e.g. a*, [a-z]* + One or more, e.g. a+, [a-z]+ ? Optional, e.g. a?, [a-z]? [..] A set or range of characters, e.g. [aeiou], [a-z0-9] (..) Grouping parentheses, e.g. (the|a|an)? \b Word boundary (zero width) \d Any decimal digit (\D is any non-digit) \s Any whitespace character (\S is any non-whitespace character) \w Any alphanumeric character (\W is any non-alphanumeric character \t The tab character \n The newline character
2.7.3 Exercises
1.
☼ Describe the class of strings matched by the following regular expressions. Note that '*' means: match zero or more occurrences of the preceding regular expression.
1. [a-zA-Z]+
2. [A-Z][a-z]*
3. \d+(\.\d+)?
4. ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*
5. \w+|[^\w\s]+
Test your answers using re_show(). 2.
☼ Write regular expressions to match the following classes of strings:
1. A single determiner (assume that a, an, and the are the only determiners).
2. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.
3.
◑ The above example of extracting (name, domain) pairs from text does not work when there is more than one email address on a line, because the + operator is "greedy" and consumes too much of the input.
1. Experiment with input text containing more than one email address per line, such as that shown below. What happens?
2. Using re.findall(), write another regular expression to extract email addresses, replacing the period character with a range or negated range, such as [a-z]+ or [^ >]+.
3. Now try to match email addresses by changing the regular expression .+ to its "non-greedy" counterpart, .+?
>>> s = """
... austen-emma.txt:hart@vmd.cso.uiuc.edu (internet) hart@uiucvmd (bitnet)
... austen-emma.txt:Internet (72600.2026@compuserve.com); TEL: (212-254-5093)
... austen-persuasion.txt:Editing by Martin Ward (Martin.Ward@uk.ac.durham)
... blake-songs.txt:Prepared by David Price, email ccx074@coventry.ac.uk
... """
4.
◑ Write code to convert text into Pig Latin. This involves two steps: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. http://en.wikipedia.org/wiki/Pig_Latin 5.
◑ Write code to convert text into hAck3r again, this time using regular expressions and substitution, where e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: $ for word-initial s, and 5 for word-internal s. 6.
★ Read the Wikipedia entry on the Soundex Algorithm. Implement this algorithm in Python.
2.8 Summary
* Text is represented in Python using strings, and we type these with single or double quotes: 'Hello', "World".
* The characters of a string are accessed using indexes, counting from zero: 'Hello World'[1] gives the value e. The length of a string is found using len().
* Substrings are accessed using slice notation: 'Hello World'[1:5] gives the value ello. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.
* Sequences of words are represented in Python using lists of strings: ['colorless', 'green', 'ideas']. We can use indexing, slicing and the len() function on lists.
* Strings can be split into lists: 'Hello World'.split() gives ['Hello', 'World']. Lists can be joined into strings: '/'.join(['Hello', 'World']) gives 'Hello/World'.
* Lists can be sorted in-place: words.sort(). To produce a separate, sorted copy, use: sorted(words).
* We process each item in a string or list using a for statement: for word in phrase. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
* We test a condition using an if statement: if len(word) < 5. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
* A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.
* Some functions are not available by default, but must be accessed using Python's import statement.
2.9 Further Reading 2.9.1 Python
Two freely available online texts are the following:
* Josh Cogliati, Non-Programmer's Tutorial for Python, http://en.wikibooks.org/wiki/Non-Programmer's_Tutorial_for_Python/Contents * Allen B. Downey, Jeffrey Elkner and Chris Meyers, How to Think Like a Computer Scientist: Learning with Python, http://www.ibiblio.org/obp/thinkCSpy/
[Rossum & Jr., 2006] is a tutorial introduction to Python by Guido van Rossum, the inventor of Python and Fred L. Drake, Jr., the official editor of the Python documentation. It is available online at http://docs.python.org/tut/tut.html. A more detailed but still introductory text is [Lutz & Ascher, 2003], which covers the essential features of Python, and also provides an overview of the standard libraries.
[Beazley, 2006] is a succinct reference book; although not suitable as an introduction to Python, it is an excellent resource for intermediate and advanced programmers.
Finally, it is always worth checking the official Python Documentation at http://docs.python.org/. 2.9.2 Regular Expressions
There are many references for regular expressions, both practical and theoretical. [Friedl, 2002] is a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python.
For an introductory tutorial to using regular expressions in Python with the re module, see A. M. Kuchling, Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/.
Chapter 3 of [Mertz, 2003] provides a more extended tutorial on Python's facilities for text processing with regular expressions.
http://www.regular-expressions.info/ is a useful online resource, providing a tutorial and references to tools and other sources of information. 2.9.3 Unicode
There are a number of online discussions of Unicode in general, and of Python facilities for handling Unicode. The following are worth consulting:
* Jason Orendorff, Unicode for Programmers, http://www.jorendorff.com/articles/unicode/. * A. M. Kuchling, Unicode HOWTO, http://www.amk.ca/python/howto/unicode * Frederik Lundh, Python Unicode Objects, http://effbot.org/zone/unicode-objects.htm * Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html
About this document...
This chapter is a draft from Introduction to Natural Language Processing [1], by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2007 the authors. It is distributed with the Natural Language Toolkit [2], Version 0.9, under the terms of the Creative Commons Attribution-ShareAlike License [3].
This document is Revision: 5468 Fri Oct 12 16:38:22 EST 2007



