Understanding the Linux Kernel [2nd Edition]


427 87 5MB

English Pages 829

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Understanding the Linux Kernel [2nd Edition]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

I l@ve Ru Bo a rd



Ta b le o f Co n t e n t s



In d e x



Re vie ws



Re a d e r Re vie ws



Erra t a

Understanding the Linux Kernel, 2nd Edition By Da n ie l P. Bo ve t , Ma rco Ce s a t i

Pu b lis h e r

: O'Re illy

Pu b Da t e

: De ce m b e r 2 0 0 2

IS BN

: 0-596-00213-0

Pa g e s

: 784

Th e n e w e d it io n o f Un d e rs t a n d in g t h e Lin u x Ke rn e l t a ke s yo u o n a g u id e d t o u r t h ro u g h t h e m o s t s ig n ifica n t d a t a s t ru ct u re s , m a n y a lg o rit h m s , a n d p ro g ra m m in g t ricks u s e d in t h e ke rn e l. Th e b o o k h a s b e e n u p d a t e d t o co ve r ve rs io n 2 . 4 o f t h e ke rn e l, wh ich is q u it e d iffe re n t fro m ve rs io n 2 . 2 : t h e virt u a l m e m o ry s ys t e m is e n t ire ly n e w, s u p p o rt fo r m u lt ip ro ce s s o r s ys t e m s is im p ro ve d , a n d wh o le n e w cla s s e s o f h a rd wa re d e vice s h a ve b e e n a d d e d . Yo u 'll le a rn wh a t co n d it io n s b rin g o u t Lin u x's b e s t p e rfo rm a n ce , a n d h o w it m e e t s t h e ch a lle n g e o f p ro vid in g g o o d s ys t e m re s p o n s e d u rin g p ro ce s s s ch e d u lin g , file a cce s s , a n d m e m o ry m a n a g e m e n t in a wid e va rie t y o f e n viro n m e n t s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd



Ta b le o f Co n t e n t s



In d e x



Re vie ws



Re a d e r Re vie ws



Erra t a

Understanding the Linux Kernel, 2nd Edition By Da n ie l P. Bo ve t , Ma rco Ce s a t i

Pu b lis h e r

: O'Re illy

Pu b Da t e

: De ce m b e r 2 0 0 2

IS BN

: 0-596-00213-0

Pa g e s

: 784

Co p yrig h t Pre fa ce Th e Au d ie n ce fo r Th is Bo o k Org a n iza t io n o f t h e Ma t e ria l Ove rvie w o f t h e Bo o k Ba ckg ro u n d In fo rm a t io n Co n ve n t io n s in Th is Bo o k Ho w t o Co n t a ct Us Ackn o wle d g m e n t s Ch a p t e r 1 . In t ro d u ct io n S e ct io n 1 . 1 . Lin u x Ve rs u s Ot h e r Un ix- Like Ke rn e ls S e ct io n 1 . 2 . Ha rd wa re De p e n d e n cy S e ct io n 1 . 3 . Lin u x Ve rs io n s S e ct io n 1 . 4 . Ba s ic Op e ra t in g S ys t e m Co n ce p t s S e ct io n 1 . 5 . An Ove rvie w o f t h e Un ix File s ys t e m S e ct io n 1 . 6 . An Ove rvie w o f Un ix Ke rn e ls Ch a p t e r 2 . Me m o ry Ad d re s s in g S e ct io n 2 . 1 . Me m o ry Ad d re s s e s S e ct io n 2 . 2 . S e g m e n t a t io n in Ha rd wa re S e ct io n 2 . 3 . S e g m e n t a t io n in Lin u x S e ct io n 2 . 4 . Pa g in g in Ha rd wa re S e ct io n 2 . 5 . Pa g in g in Lin u x Ch a p t e r 3 . Pro ce s s e s S e ct io n 3 . 1 . Pro ce s s e s , Lig h t we ig h t Pro ce s s e s , a n d Th re a d s

S e ct io n 3 . 2 . Pro ce s s De s crip t o r S e ct io n 3 . 3 . Pro ce s s S wit ch S e ct io n 3 . 4 . Cre a t in g Pro ce s s e s S e ct io n 3 . 5 . De s t ro yin g Pro ce s s e s Ch a p t e r 4 . In t e rru p t s a n d Exce p t io n s S e ct io n 4 . 1 . Th e Ro le o f In t e rru p t S ig n a ls S e ct io n 4 . 2 . In t e rru p t s a n d Exce p t io n s S e ct io n 4 . 3 . Ne s t e d Exe cu t io n o f Exce p t io n a n d In t e rru p t Ha n d le rs S e ct io n 4 . 4 . In it ia lizin g t h e In t e rru p t De s crip t o r Ta b le S e ct io n 4 . 5 . Exce p t io n Ha n d lin g S e ct io n 4 . 6 . In t e rru p t Ha n d lin g S e ct io n 4 . 7 . S o ft irq s , Ta s kle t s , a n d Bo t t o m Ha lve s S e ct io n 4 . 8 . Re t u rn in g fro m In t e rru p t s a n d Exce p t io n s Ch a p t e r 5 . Ke rn e l S yn ch ro n iza t io n S e ct io n 5 . 1 . Ke rn e l Co n t ro l Pa t h s S e ct io n 5 . 2 . Wh e n S yn ch ro n iza t io n Is No t Ne ce s s a ry S e ct io n 5 . 3 . S yn ch ro n iza t io n Prim it ive s S e ct io n 5 . 4 . S yn ch ro n izin g Acce s s e s t o Ke rn e l Da t a S t ru ct u re s S e ct io n 5 . 5 . Exa m p le s o f Ra ce Co n d it io n Pre ve n t io n Ch a p t e r 6 . Tim in g Me a s u re m e n t s S e ct io n 6 . 1 . Ha rd wa re Clo cks S e ct io n 6 . 2 . Th e Lin u x Tim e ke e p in g Arch it e ct u re S e ct io n 6 . 3 . CPU's Tim e S h a rin g S e ct io n 6 . 4 . Up d a t in g t h e Tim e a n d Da t e S e ct io n 6 . 5 . Up d a t in g S ys t e m S t a t is t ics S e ct io n 6 . 6 . S o ft wa re Tim e rs S e ct io n 6 . 7 . S ys t e m Ca lls Re la t e d t o Tim in g Me a s u re m e n t s Ch a p t e r 7 . Me m o ry Ma n a g e m e n t S e ct io n 7 . 1 . Pa g e Fra m e Ma n a g e m e n t S e ct io n 7 . 2 . Me m o ry Are a Ma n a g e m e n t S e ct io n 7 . 3 . No n co n t ig u o u s Me m o ry Are a Ma n a g e m e n t Ch a p t e r 8 . Pro ce s s Ad d re s s S p a ce S e ct io n 8 . 1 . Th e Pro ce s s 's Ad d re s s S p a ce S e ct io n 8 . 2 . Th e Me m o ry De s crip t o r S e ct io n 8 . 3 . Me m o ry Re g io n s S e ct io n 8 . 4 . Pa g e Fa u lt Exce p t io n Ha n d le r S e ct io n 8 . 5 . Cre a t in g a n d De le t in g a Pro ce s s Ad d re s s S p a ce S e ct io n 8 . 6 . Ma n a g in g t h e He a p Ch a p t e r 9 . S ys t e m Ca lls S e ct io n 9 . 1 . POS IX APIs a n d S ys t e m Ca lls S e ct io n 9 . 2 . S ys t e m Ca ll Ha n d le r a n d S e rvice Ro u t in e s S e ct io n 9 . 3 . Ke rn e l Wra p p e r Ro u t in e s Ch a p t e r 1 0 . S ig n a ls S e ct io n 1 0 . 1 . Th e Ro le o f S ig n a ls S e ct io n 1 0 . 2 . Ge n e ra t in g a S ig n a l S e ct io n 1 0 . 3 . De live rin g a S ig n a l S e ct io n 1 0 . 4 . S ys t e m Ca lls Re la t e d t o S ig n a l Ha n d lin g

Ch a p t e r 1 1 . Pro ce s s S ch e d u lin g S e ct io n 1 1 . 1 . S ch e d u lin g Po licy S e ct io n 1 1 . 2 . Th e S ch e d u lin g Alg o rit h m S e ct io n 1 1 . 3 . S ys t e m Ca lls Re la t e d t o S ch e d u lin g Ch a p t e r 1 2 . Th e Virt u a l File s ys t e m S e ct io n 1 2 . 1 . Th e Ro le o f t h e Virt u a l File s ys t e m ( VFS ) S e ct io n 1 2 . 2 . VFS Da t a S t ru ct u re s S e ct io n 1 2 . 3 . File s ys t e m Typ e s S e ct io n 1 2 . 4 . File s ys t e m Mo u n t in g S e ct io n 1 2 . 5 . Pa t h n a m e Lo o ku p S e ct io n 1 2 . 6 . Im p le m e n t a t io n s o f VFS S ys t e m Ca lls S e ct io n 1 2 . 7 . File Lo ckin g Ch a p t e r 1 3 . Ma n a g in g I/ O De vice s S e ct io n 1 3 . 1 . I/ O Arch it e ct u re S e ct io n 1 3 . 2 . De vice File s S e ct io n 1 3 . 3 . De vice Drive rs S e ct io n 1 3 . 4 . Blo ck De vice Drive rs S e ct io n 1 3 . 5 . Ch a ra ct e r De vice Drive rs Ch a p t e r 1 4 . Dis k Ca ch e s S e ct io n 1 4 . 1 . Th e Pa g e Ca ch e S e ct io n 1 4 . 2 . Th e Bu ffe r Ca ch e Ch a p t e r 1 5 . Acce s s in g File s S e ct io n 1 5 . 1 . Re a d in g a n d Writ in g a File S e ct io n 1 5 . 2 . Me m o ry Ma p p in g S e ct io n 1 5 . 3 . Dire ct I/ O Tra n s fe rs Ch a p t e r 1 6 . S wa p p in g : Me t h o d s fo r Fre e in g Me m o ry S e ct io n 1 6 . 1 . Wh a t Is S wa p p in g ? S e ct io n 1 6 . 2 . S wa p Are a S e ct io n 1 6 . 3 . Th e S wa p Ca ch e S e ct io n 1 6 . 4 . Tra n s fe rrin g S wa p Pa g e s S e ct io n 1 6 . 5 . S wa p p in g Ou t Pa g e s S e ct io n 1 6 . 6 . S wa p p in g in Pa g e s S e ct io n 1 6 . 7 . Re cla im in g Pa g e Fra m e Ch a p t e r 1 7 . Th e Ext 2 a n d Ext 3 File s ys t e m s S e ct io n 1 7 . 1 . Ge n e ra l Ch a ra ct e ris t ics o f Ext 2 S e ct io n 1 7 . 2 . Ext 2 Dis k Da t a S t ru ct u re s S e ct io n 1 7 . 3 . Ext 2 Me m o ry Da t a S t ru ct u re s S e ct io n 1 7 . 4 . Cre a t in g t h e Ext 2 File s ys t e m S e ct io n 1 7 . 5 . Ext 2 Me t h o d s S e ct io n 1 7 . 6 . Ma n a g in g Ext 2 Dis k S p a ce S e ct io n 1 7 . 7 . Th e Ext 3 File s ys t e m Ch a p t e r 1 8 . Ne t wo rkin g S e ct io n 1 8 . 1 . Ma in Ne t wo rkin g Da t a S t ru ct u re s S e ct io n 1 8 . 2 . S ys t e m Ca lls Re la t e d t o Ne t wo rkin g S e ct io n 1 8 . 3 . S e n d in g Pa cke t s t o t h e Ne t wo rk Ca rd S e ct io n 1 8 . 4 . Re ce ivin g Pa cke t s fro m t h e Ne t wo rk Ca rd Ch a p t e r 1 9 . Pro ce s s Co m m u n ica t io n

S e ct io n 1 9 . 1 . Pip e s S e ct io n 1 9 . 2 . FIFOs S e ct io n 1 9 . 3 . S ys t e m V IPC Ch a p t e r 2 0 . Pro g ra m Exe cu t io n S e ct io n 2 0 . 1 . Exe cu t a b le File s S e ct io n 2 0 . 2 . Exe cu t a b le Fo rm a t s S e ct io n 2 0 . 3 . Exe cu t io n Do m a in s S e ct io n 2 0 . 4 . Th e e xe c Fu n ct io n s Ap p e n d ix A. S ys t e m S t a rt u p S e ct io n A. 1 . Pre h is t o ric Ag e : Th e BIOS S e ct io n A. 2 . An cie n t Ag e : Th e Bo o t Lo a d e r S e ct io n A. 3 . Mid d le Ag e s : Th e s e t u p ( ) Fu n ct io n S e ct io n A. 4 . Re n a is s a n ce : Th e s t a rt u p _ 3 2 ( ) Fu n ct io n s S e ct io n A. 5 . Mo d e rn Ag e : Th e s t a rt _ ke rn e l( ) Fu n ct io n Ap p e n d ix B. Mo d u le s S e ct io n B. 1 . To Be ( a Mo d u le ) o r No t t o Be ? S e ct io n B. 2 . Mo d u le Im p le m e n t a t io n S e ct io n B. 3 . Lin kin g a n d Un lin kin g Mo d u le s S e ct io n B. 4 . Lin kin g Mo d u le s o n De m a n d Ap p e n d ix C. S o u rce Co d e S t ru ct u re Bib lio g ra p h y Bo o ks o n Un ix Ke rn e ls Bo o ks o n t h e Lin u x Ke rn e l Bo o ks o n PC Arch it e ct u re a n d Te ch n ica l Ma n u a ls o n In t e l Micro p ro ce s s o rs Ot h e r On lin e Do cu m e n t a t io n S o u rce s Co lo p h o n In d e x

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Copyright Co p yrig h t © 2 0 0 3 O'Re illy & As s o cia t e s , In c. Prin t e d in t h e Un it e d S t a t e s o f Am e rica . Pu b lis h e d b y O'Re illy & As s o cia t e s , In c. , 1 0 0 5 Gra ve n s t e in Hig h wa y No rt h , S e b a s t o p o l, CA 95472. O'Re illy & As s o cia t e s b o o ks m a y b e p u rch a s e d fo r e d u ca t io n a l, b u s in e s s , o r s a le s p ro m o t io n a l u s e . On lin e e d it io n s a re a ls o a va ila b le fo r m o s t t it le s ( h t t p : / / s a fa ri. o re illy. co m ) . Fo r m o re in fo rm a t io n , co n t a ct o u r co rp o ra t e / in s t it u t io n a l s a le s d e p a rt m e n t : ( 8 0 0 ) 9 9 8 - 9 9 3 8 o r co rp o ra t e @o re illy. co m . Nu t s h e ll Ha n d b o o k, t h e Nu t s h e ll Ha n d b o o k lo g o , a n d t h e O'Re illy lo g o a re re g is t e re d t ra d e m a rks o f O'Re illy & As s o cia t e s , In c. Ma n y o f t h e d e s ig n a t io n s u s e d b y m a n u fa ct u re rs a n d s e lle rs t o d is t in g u is h t h e ir p ro d u ct s a re cla im e d a s t ra d e m a rks . Wh e re t h o s e d e s ig n a t io n s a p p e a r in t h is b o o k, a n d O'Re illy & As s o cia t e s , In c. wa s a wa re o f a t ra d e m a rk cla im , t h e d e s ig n a t io n s h a ve b e e n p rin t e d in ca p s o r in it ia l ca p s . Th e a s s o cia t io n b e t we e n t h e im a g e s o f t h e Am e rica n We s t a n d t h e t o p ic o f Lin u x is a t ra d e m a rk o f O'Re illy & As s o cia t e s , In c. Wh ile e ve ry p re ca u t io n h a s b e e n t a ke n in t h e p re p a ra t io n o f t h is b o o k, t h e p u b lis h e r a n d a u t h o rs a s s u m e n o re s p o n s ib ilit y fo r e rro rs o r o m is s io n s , o r fo r d a m a g e s re s u lt in g fro m t h e u s e o f t h e in fo rm a t io n co n t a in e d h e re in . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Preface In t h e s p rin g s e m e s t e r o f 1 9 9 7 , we t a u g h t a co u rs e o n o p e ra t in g s ys t e m s b a s e d o n Lin u x 2 . 0 . Th e id e a wa s t o e n co u ra g e s t u d e n t s t o re a d t h e s o u rce co d e . To a ch ie ve t h is , we a s s ig n e d t e rm p ro je ct s co n s is t in g o f m a kin g ch a n g e s t o t h e ke rn e l a n d p e rfo rm in g t e s t s o n t h e m o d ifie d ve rs io n . We a ls o wro t e co u rs e n o t e s fo r o u r s t u d e n t s a b o u t a fe w crit ica l fe a t u re s o f Lin u x s u ch a s t a s k s wit ch in g a n d t a s k s ch e d u lin g . Ou t o f t h is wo rk — a n d wit h a lo t o f s u p p o rt fro m o u r O'Re illy e d it o r An d y Ora m — ca m e t h e firs t e d it io n o f Un d e rs t a n d in g t h e Lin u x Ke rn e l a n d t h e e n d o f 2 0 0 0 , wh ich co ve re d Lin u x 2 . 2 wit h a fe w a n t icip a t io n s o n Lin u x 2 . 4 . Th e s u cce s s e n co u n t e re d b y t h is b o o k e n co u ra g e d u s t o co n t in u e a lo n g t h is lin e , a n d in t h e fa ll o f 2 0 0 1 we s t a rt e d p la n n in g a s e co n d e d it io n co ve rin g Lin u x 2 . 4 . Ho we ve r, Lin u x 2 . 4 is q u it e d iffe re n t fro m Lin u x 2 . 2 . Ju s t t o m e n t io n a fe w e xa m p le s , t h e virt u a l m e m o ry s ys t e m is e n t ire ly n e w, s u p p o rt fo r m u lt ip ro ce s s o r s ys t e m s is m u ch b e t t e r, a n d wh o le n e w cla s s e s o f h a rd wa re d e vice s h a ve b e e n a d d e d . As a re s u lt , we h a d t o re writ e fro m s cra t ch t wo - t h ird s o f t h e b o o k, in cre a s in g it s s ize b y ro u g h ly 2 5 p e rce n t . As in o u r firs t e xp e rie n ce , we re a d t h o u s a n d s o f lin e s o f co d e , t ryin g t o m a ke s e n s e o f t h e m . Aft e r a ll t h is wo rk, we ca n s a y t h a t it wa s wo rt h t h e e ffo rt . We le a rn e d a lo t o f t h in g s yo u d o n 't fin d in b o o ks , a n d we h o p e we h a ve s u cce e d e d in co n ve yin g s o m e o f t h is in fo rm a t io n in t h e fo llo win g p a g e s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

The Audience for This Book All p e o p le cu rio u s a b o u t h o w Lin u x wo rks a n d wh y it is s o e fficie n t will fin d a n s we rs h e re . Aft e r re a d in g t h e b o o k, yo u will fin d yo u r wa y t h ro u g h t h e m a n y t h o u s a n d s o f lin e s o f co d e , d is t in g u is h in g b e t we e n cru cia l d a t a s t ru ct u re s a n d s e co n d a ry o n e s —in s h o rt , b e co m in g a t ru e Lin u x h a cke r. Ou r wo rk m ig h t b e co n s id e re d a g u id e d t o u r o f t h e Lin u x ke rn e l: m o s t o f t h e s ig n ifica n t d a t a s t ru ct u re s a n d m a n y a lg o rit h m s a n d p ro g ra m m in g t ricks u s e d in t h e ke rn e l a re d is cu s s e d . In m a n y ca s e s , t h e re le va n t fra g m e n t s o f co d e a re d is cu s s e d lin e b y lin e . Of co u rs e , yo u s h o u ld h a ve t h e Lin u x s o u rce co d e o n h a n d a n d s h o u ld b e willin g t o s p e n d s o m e e ffo rt d e cip h e rin g s o m e o f t h e fu n ct io n s t h a t a re n o t , fo r s a ke o f b re vit y, fu lly d e s crib e d . On a n o t h e r le ve l, t h e b o o k p ro vid e s va lu a b le in s ig h t t o p e o p le wh o wa n t t o kn o w m o re a b o u t t h e crit ica l d e s ig n is s u e s in a m o d e rn o p e ra t in g s ys t e m . It is n o t s p e cifica lly a d d re s s e d t o s ys t e m a d m in is t ra t o rs o r p ro g ra m m e rs ; it is m o s t ly fo r p e o p le wh o wa n t t o u n d e rs t a n d h o w t h in g s re a lly wo rk in s id e t h e m a ch in e ! As wit h a n y g o o d g u id e , we t ry t o g o b e yo n d s u p e rficia l fe a t u re s . We o ffe r a b a ckg ro u n d , s u ch a s t h e h is t o ry o f m a jo r fe a t u re s a n d t h e re a s o n s wh y t h e y we re u s e d . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Organization of the Material Wh e n we b e g a n t o writ e t h is b o o k, we we re fa ce d wit h a crit ica l d e cis io n : s h o u ld we re fe r t o a s p e cific h a rd wa re p la t fo rm o r s kip t h e h a rd wa re - d e p e n d e n t d e t a ils a n d co n ce n t ra t e o n t h e p u re h a rd wa re - in d e p e n d e n t p a rt s o f t h e ke rn e l? Ot h e rs b o o ks o n Lin u x ke rn e l in t e rn a ls h a ve ch o s e n t h e la t t e r a p p ro a ch ; we d e cid e d t o a d o p t t h e fo rm e r o n e fo r t h e fo llo win g re a s o n s : ●



Efficie n t ke rn e ls t a ke a d va n t a g e o f m o s t a va ila b le h a rd wa re fe a t u re s , s u ch a s a d d re s s in g t e ch n iq u e s , ca ch e s , p ro ce s s o r e xce p t io n s , s p e cia l in s t ru ct io n s , p ro ce s s o r co n t ro l re g is t e rs , a n d s o o n . If we wa n t t o co n vin ce yo u t h a t t h e ke rn e l in d e e d d o e s q u it e a g o o d jo b in p e rfo rm in g a s p e cific t a s k, we m u s t firs t t e ll wh a t kin d o f s u p p o rt co m e s fro m t h e h a rd wa re . Eve n if a la rg e p o rt io n o f a Un ix ke rn e l s o u rce co d e is p ro ce s s o r- in d e p e n d e n t a n d co d e d in C la n g u a g e , a s m a ll a n d crit ica l p a rt is co d e d in a s s e m b ly la n g u a g e . A t h o ro u g h kn o wle d g e o f t h e ke rn e l t h e re fo re re q u ire s t h e s t u d y o f a fe w a s s e m b ly la n g u a g e fra g m e n t s t h a t in t e ra ct wit h t h e h a rd wa re .

Wh e n co ve rin g h a rd wa re fe a t u re s , o u r s t ra t e g y is q u it e s im p le : ju s t s ke t ch t h e fe a t u re s t h a t a re t o t a lly h a rd wa re - d rive n wh ile d e t a ilin g t h o s e t h a t n e e d s o m e s o ft wa re s u p p o rt . In fa ct , we a re in t e re s t e d in ke rn e l d e s ig n ra t h e r t h a n in co m p u t e r a rch it e ct u re . Ou r n e xt s t e p in ch o o s in g o u r p a t h co n s is t e d o f s e le ct in g t h e co m p u t e r s ys t e m t o d e s crib e . Alt h o u g h Lin u x is n o w ru n n in g o n s e ve ra l kin d s o f p e rs o n a l co m p u t e rs a n d wo rks t a t io n s , we d e cid e d t o co n ce n t ra t e o n t h e ve ry p o p u la r a n d ch e a p IBM- co m p a t ib le p e rs o n a l co m p u t e rs —a n d t h u s o n t h e 8 0 x 8 6 m icro p ro ce s s o rs a n d o n s o m e s u p p o rt ch ip s in clu d e d in t h e s e p e rs o n a l co m p u t e rs . Th e t e rm 8 0 x 8 6 m icro p ro ce s s o r will b e u s e d in t h e fo rt h co m in g ch a p t e rs t o d e n o t e t h e In t e l 8 0 3 8 6 , 8 0 4 8 6 , Pe n t iu m , Pe n t iu m Pro , Pe n t iu m II, Pe n t iu m III, a n d Pe n t iu m 4 m icro p ro ce s s o rs o r co m p a t ib le m o d e ls . In a fe w ca s e s , e xp licit re fe re n ce s will b e m a d e t o s p e cific m o d e ls . On e m o re ch o ice we h a d t o m a ke wa s t h e o rd e r t o fo llo w in s t u d yin g Lin u x co m p o n e n t s . We t rie d a b o t t o m - u p a p p ro a ch : s t a rt wit h t o p ics t h a t a re h a rd wa re - d e p e n d e n t a n d e n d wit h t h o s e t h a t a re t o t a lly h a rd wa re - in d e p e n d e n t . In fa ct , we 'll m a ke m a n y re fe re n ce s t o t h e 8 0 x 8 6 m icro p ro ce s s o rs in t h e firs t p a rt o f t h e b o o k, wh ile t h e re s t o f it is re la t ive ly h a rd wa re in d e p e n d e n t . On e s ig n ifica n t e xce p t io n is m a d e in Ch a p t e r 1 3 . In p ra ct ice , fo llo win g a b o t t o m - u p a p p ro a ch is n o t a s s im p le a s it lo o ks , s in ce t h e a re a s o f m e m o ry m a n a g e m e n t , p ro ce s s m a n a g e m e n t , a n d file s ys t e m s a re in t e rt win e d ; a fe w fo rwa rd re fe re n ce s —t h a t is , re fe re n ce s t o t o p ics ye t t o b e e xp la in e d —a re u n a vo id a b le . Ea ch ch a p t e r s t a rt s wit h a t h e o re t ica l o ve rvie w o f t h e t o p ics co ve re d . Th e m a t e ria l is t h e n p re s e n t e d a cco rd in g t o t h e b o t t o m - u p a p p ro a ch . We s t a rt wit h t h e d a t a s t ru ct u re s n e e d e d t o s u p p o rt t h e fu n ct io n a lit ie s d e s crib e d in t h e ch a p t e r. Th e n we u s u a lly m o ve fro m t h e lo we s t le ve l o f fu n ct io n s t o h ig h e r le ve ls , o ft e n e n d in g b y s h o win g h o w s ys t e m ca lls is s u e d b y u s e r a p p lica t io n s a re s u p p o rt e d .

Level of Description Lin u x s o u rce co d e fo r a ll s u p p o rt e d a rch it e ct u re s is co n t a in e d in m o re t h a n 8 , 0 0 0 C a n d a s s e m b ly la n g u a g e file s s t o re d in a b o u t 5 3 0 s u b d ire ct o rie s ; it co n s is t s o f ro u g h ly 4 m illio n lin e s o f co d e , wh ich o ccu p y o ve r 1 4 4 m e g a b yt e s o f d is k s p a ce . Of co u rs e , t h is b o o k ca n

co ve r o n ly a ve ry s m a ll p o rt io n o f t h a t co d e . Ju s t t o fig u re o u t h o w b ig t h e Lin u x s o u rce is , co n s id e r t h a t t h e wh o le s o u rce co d e o f t h e b o o k yo u a re re a d in g o ccu p ie s le s s t h a n 3 m e g a b yt e s o f d is k s p a ce . Th e re fo re , we wo u ld n e e d m o re t h a n 4 0 b o o ks like t h is t o lis t a ll co d e , wit h o u t e ve n co m m e n t in g o n it ! S o we h a d t o m a ke s o m e ch o ice s a b o u t t h e p a rt s t o d e s crib e . Th is is a ro u g h a s s e s s m e n t o f o u r d e cis io n s : ● ●





We d e s crib e p ro ce s s a n d m e m o ry m a n a g e m e n t fa irly t h o ro u g h ly. We co ve r t h e Virt u a l File s ys t e m a n d t h e Ext 2 a n d Ext 3 file s ys t e m s , a lt h o u g h m a n y fu n ct io n s a re ju s t m e n t io n e d wit h o u t d e t a ilin g t h e co d e ; we d o n o t d is cu s s o t h e r file s ys t e m s s u p p o rt e d b y Lin u x. We d e s crib e d e vice d rive rs , wh ich a cco u n t fo r a g o o d p a rt o f t h e ke rn e l, a s fa r a s t h e ke rn e l in t e rfa ce is co n ce rn e d , b u t d o n o t a t t e m p t a n a lys is o f e a ch s p e cific d rive r, in clu d in g t h e t e rm in a l d rive rs . We co ve r t h e in n e r la ye rs o f n e t wo rkin g in a ra t h e r s ke t ch y wa y, s in ce t h is a re a d e s e rve s a wh o le n e w b o o k b y it s e lf.

Th e b o o k d e s crib e s t h e o fficia l 2 . 4 . 1 8 ve rs io n o f t h e Lin u x ke rn e l, wh ich ca n b e d o wn lo a d e d fro m t h e we b s it e , h t t p : / / www. ke rn e l. o rg . Be a wa re t h a t m o s t d is t rib u t io n s o f GNU/ Lin u x m o d ify t h e o fficia l ke rn e l t o im p le m e n t n e w fe a t u re s o r t o im p ro ve it s e fficie n cy. In a fe w ca s e s , t h e s o u rce co d e p ro vid e d b y yo u r fa vo rit e d is t rib u t io n m ig h t d iffe r s ig n ifica n t ly fro m t h e o n e d e s crib e d in t h is b o o k. In m a n y ca s e s , t h e o rig in a l co d e h a s b e e n re writ t e n in a n e a s ie r- t o - re a d b u t le s s e fficie n t wa y. Th is o ccu rs a t t im e - crit ica l p o in t s a t wh ich s e ct io n s o f p ro g ra m s a re o ft e n writ t e n in a m ixt u re o f h a n d - o p t im ize d C a n d As s e m b ly co d e . On ce a g a in , o u r a im is t o p ro vid e s o m e h e lp in s t u d yin g t h e o rig in a l Lin u x co d e . Wh ile d is cu s s in g ke rn e l co d e , we o ft e n e n d u p d e s crib in g t h e u n d e rp in n in g s o f m a n y fa m ilia r fe a t u re s t h a t Un ix p ro g ra m m e rs h a ve h e a rd o f a n d a b o u t wh ich t h e y m a y b e cu rio u s ( s h a re d a n d m a p p e d m e m o ry, s ig n a ls , p ip e s , s ym b o lic lin ks , e t c. ) . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Overview of the Book To m a ke life e a s ie r, Ch a p t e r 1 p re s e n t s a g e n e ra l p ict u re o f wh a t is in s id e a Un ix ke rn e l a n d h o w Lin u x co m p e t e s a g a in s t o t h e r we ll- kn o wn Un ix s ys t e m s . Th e h e a rt o f a n y Un ix ke rn e l is m e m o ry m a n a g e m e n t . Ch a p t e r 2 e xp la in s h o w 8 0 x 8 6 p ro ce s s o rs in clu d e s p e cia l circu it s t o a d d re s s d a t a in m e m o ry a n d h o w Lin u x e xp lo it s t h e m . Pro ce s s e s a re a fu n d a m e n t a l a b s t ra ct io n o ffe re d b y Lin u x a n d a re in t ro d u ce d in Ch a p t e r 3 . He re we a ls o e xp la in h o w e a ch p ro ce s s ru n s e it h e r in a n u n p rivile g e d Us e r Mo d e o r in a p rivile g e d Ke rn e l Mo d e . Tra n s it io n s b e t we e n Us e r Mo d e a n d Ke rn e l Mo d e h a p p e n o n ly t h ro u g h we ll- e s t a b lis h e d h a rd wa re m e ch a n is m s ca lle d in t e rru p t s a n d e x ce p t io n s . Th e s e a re in t ro d u ce d in Ch a p t e r 4 . In m a n y o cca s io n s , t h e ke rn e l h a s t o d e a l wit h b u rs t s o f in t e rru p t s co m in g fro m d iffe re n t d e vice s . S yn ch ro n iza t io n m e ch a n is m s a re n e e d e d s o t h a t a ll t h e s e re q u e s t s ca n b e s e rvice d in a n in t e rle a ve d wa y b y t h e ke rn e l: t h e y a re d is cu s s e d in Ch a p t e r 5 fo r b o t h u n ip ro ce s s o r a n d m u lt ip ro ce s s o r s ys t e m s . On e t yp e o f in t e rru p t is cru cia l fo r a llo win g Lin u x t o t a ke ca re o f e la p s e d t im e ; fu rt h e r d e t a ils ca n b e fo u n d in Ch a p t e r 6 . Ne xt we fo cu s a g a in o n m e m o ry: Ch a p t e r 7 d e s crib e s t h e s o p h is t ica t e d t e ch n iq u e s re q u ire d t o h a n d le t h e m o s t p re cio u s re s o u rce in t h e s ys t e m ( b e s id e s t h e p ro ce s s o rs , o f co u rs e ) , a va ila b le m e m o ry. Th is re s o u rce m u s t b e g ra n t e d b o t h t o t h e Lin u x ke rn e l a n d t o t h e u s e r a p p lica t io n s . Ch a p t e r 8 s h o ws h o w t h e ke rn e l co p e s wit h t h e re q u e s t s fo r m e m o ry is s u e d b y g re e d y a p p lica t io n p ro g ra m s . Ch a p t e r 9 e xp la in s h o w a p ro ce s s ru n n in g in Us e r Mo d e m a ke s re q u e s t s t o t h e ke rn e l, wh ile Ch a p t e r 1 0 d e s crib e s h o w a p ro ce s s m a y s e n d s yn ch ro n iza t io n s ig n a ls t o o t h e r p ro ce s s e s . Ch a p t e r 1 1 e xp la in s h o w Lin u x e xe cu t e s , in t u rn , e ve ry a ct ive p ro ce s s in t h e s ys t e m s o t h a t a ll o f t h e m ca n p ro g re s s t o wa rd t h e ir co m p le t io n s . No w we a re re a d y t o m o ve o n t o a n o t h e r e s s e n t ia l t o p ic, h o w Lin u x im p le m e n t s t h e file s ys t e m . A s e rie s o f ch a p t e rs co ve r t h is t o p ic. Ch a p t e r 1 2 in t ro d u ce s a g e n e ra l la ye r t h a t s u p p o rt s m a n y d iffe re n t file s ys t e m s . S o m e Lin u x file s a re s p e cia l b e ca u s e t h e y p ro vid e t ra p d o o rs t o re a ch h a rd wa re d e vice s ; Ch a p t e r 1 3 o ffe rs in s ig h t s o n t h e s e s p e cia l file s a n d o n t h e co rre s p o n d in g h a rd wa re d e vice d rive rs . An o t h e r is s u e t o co n s id e r is d is k a cce s s t im e ; Ch a p t e r 1 4 s h o ws h o w a cle ve r u s e o f RAM re d u ce s d is k a cce s s e s , t h e re fo re im p ro vin g s ys t e m p e rfo rm a n ce s ig n ifica n t ly. Bu ild in g o n t h e m a t e ria l co ve re d in t h e s e la s t ch a p t e rs , we ca n n o w e xp la in in Ch a p t e r 1 5 h o w u s e r a p p lica t io n s a cce s s n o rm a l file s . Ch a p t e r 1 6 co m p le t e s o u r d is cu s s io n o f Lin u x m e m o ry m a n a g e m e n t a n d e xp la in s t h e t e ch n iq u e s u s e d b y Lin u x t o e n s u re t h a t e n o u g h m e m o ry is a lwa ys a va ila b le . Th e la s t ch a p t e r d e a lin g wit h file s is Ch a p t e r 1 7 wh ich illu s t ra t e s t h e m o s t fre q u e n t ly u s e d Lin u x file s ys t e m , n a m e ly Ext 2 a n d it s re ce n t e vo lu t io n , Ext 3 . Ch a p t e r 1 8 d e a ls wit h t h e lo we r la ye rs o f n e t wo rkin g . Th e la s t t wo ch a p t e rs e n d o u r d e t a ile d t o u r o f t h e Lin u x ke rn e l: Ch a p t e r 1 9 in t ro d u ce s co m m u n ica t io n m e ch a n is m s o t h e r t h a n s ig n a ls a va ila b le t o Us e r Mo d e p ro ce s s e s ; Ch a p t e r

2 0 e xp la in s h o w u s e r a p p lica t io n s a re s t a rt e d . La s t , b u t n o t le a s t , a re t h e a p p e n d ixe s : Ap p e n d ix A s ke t ch e s o u t h o w Lin u x is b o o t e d , wh ile Ap p e n d ix B d e s crib e s h o w t o d yn a m ica lly re co n fig u re t h e ru n n in g ke rn e l, a d d in g a n d re m o vin g fu n ct io n a lit ie s a s n e e d e d . Ap p e n d ix C is ju s t a lis t o f t h e d ire ct o rie s t h a t co n t a in t h e Lin u x s o u rce co d e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Background Information No p re re q u is it e s a re re q u ire d , e xce p t s o m e s kill in C p ro g ra m m in g la n g u a g e a n d p e rh a p s s o m e kn o wle d g e o f As s e m b ly la n g u a g e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Conventions in This Book Th e fo llo win g is a lis t o f t yp o g ra p h ica l co n ve n t io n s u s e d in t h is b o o k:

Constant Width Is u s e d t o s h o w t h e co n t e n t s o f co d e file s o r t h e o u t p u t fro m co m m a n d s , a n d t o in d ica t e s o u rce co d e ke ywo rd s t h a t a p p e a r in co d e . It a lic Is u s e d fo r file a n d d ire ct o ry n a m e s , p ro g ra m a n d co m m a n d n a m e s , co m m a n d - lin e o p t io n s , URLs , a n d fo r e m p h a s izin g n e w t e rm s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

How to Contact Us Ple a s e a d d re s s co m m e n t s a n d q u e s t io n s co n ce rn in g t h is b o o k t o t h e p u b lis h e r: O'Re illy & As s o cia t e s , In c. 1 0 0 5 Gra ve n s t e in Hig h wa y No rt h S e b a s t o p o l, CA 9 5 4 7 2 ( 8 0 0 ) 9 9 8 - 9 9 3 8 ( in t h e Un it e d S t a t e s o r Ca n a d a ) ( 7 0 7 ) 8 2 9 - 0 5 1 5 ( in t e rn a t io n a l o r lo ca l) ( 7 0 7 ) 8 2 9 - 0 1 0 4 ( fa x) We h a ve a we b p a g e fo r t h is b o o k, wh e re we lis t e rra t a , e xa m p le s , o r a n y a d d it io n a l in fo rm a t io n . Yo u ca n a cce s s t h is p a g e a t : h t t p : / / www. o re illy. co m / ca t a lo g / lin u xke rn e l2 / To co m m e n t o r a s k t e ch n ica l q u e s t io n s a b o u t t h is b o o k, s e n d e m a il t o : b o o kq u e s t io n s @o re illy. co m Fo r m o re in fo rm a t io n a b o u t o u r b o o ks , co n fe re n ce s , Re s o u rce Ce n t e rs , a n d t h e O'Re illy Ne t wo rk, s e e o u r we b s it e a t : h t t p : / / www. o re illy. co m

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Acknowledgments Th is b o o k wo u ld n o t h a ve b e e n writ t e n wit h o u t t h e p re cio u s h e lp o f t h e m a n y s t u d e n t s o f t h e Un ive rs it y o f Ro m e s ch o o l o f e n g in e e rin g "To r Ve rg a t a " wh o t o o k o u r co u rs e a n d t rie d t o d e cip h e r le ct u re n o t e s a b o u t t h e Lin u x ke rn e l. Th e ir s t re n u o u s e ffo rt s t o g ra s p t h e m e a n in g o f t h e s o u rce co d e le d u s t o im p ro ve o u r p re s e n t a t io n a n d co rre ct m a n y m is t a ke s . An d y Ora m , o u r wo n d e rfu l e d it o r a t O'Re illy & As s o cia t e s , d e s e rve s a lo t o f cre d it . He wa s t h e firs t a t O'Re illy t o b e lie ve in t h is p ro je ct , a n d h e s p e n t a lo t o f t im e a n d e n e rg y d e cip h e rin g o u r p re lim in a ry d ra ft s . He a ls o s u g g e s t e d m a n y wa ys t o m a ke t h e b o o k m o re re a d a b le , a n d h e wro t e s e ve ra l e xce lle n t in t ro d u ct o ry p a ra g ra p h s . Ma n y t h a n ks a ls o t o t h e O'Re illy s t a ff, e s p e cia lly Ro b Ro m a n o , t h e t e ch n ica l illu s t ra t o r, a n d Le n n y Mu e lln e r, fo r t o o ls s u p p o rt . We h a d s o m e p re s t ig io u s re vie we rs wh o re a d o u r t e xt q u it e ca re fu lly. Th e firs t e d it io n wa s ch e cke d b y ( in a lp h a b e t ica l o rd e r b y firs t n a m e ) Ala n Co x, Mich a e l Ke rris k, Pa u l Kin ze lm a n , Ra p h Le vie n , a n d Rik va n Rie l. Ere z Za d o k, Je rry Co o p e rs t e in , Jo h n Go e rze n , Mich a e l Ke rris k, Pa u l Kin ze lm a n , Rik va n Rie l, a n d Wa lt S m it h re vie we d t h is s e co n d e d it io n . Th e ir co m m e n t s , t o g e t h e r wit h t h o s e o f m a n y re a d e rs fro m a ll o ve r t h e wo rld , h e lp e d u s t o re m o ve s e ve ra l e rro rs a n d in a ccu ra cie s a n d h a ve m a d e t h is b o o k s t ro n g e r. —Da n ie l P. Bo ve t Ma rco Ce s a t i S e pte m be r 2002 I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 1. Introduction Lin u x is a m e m b e r o f t h e la rg e fa m ily o f Un ix- like o p e ra t in g s ys t e m s . A re la t ive n e wco m e r e xp e rie n cin g s u d d e n s p e ct a cu la r p o p u la rit y s t a rt in g in t h e la t e 1 9 9 0 s , Lin u x jo in s s u ch we llkn o wn co m m e rcia l Un ix o p e ra t in g s ys t e m s a s S ys t e m V Re le a s e 4 ( S VR4 ) , d e ve lo p e d b y AT&T ( n o w o wn e d b y t h e S CO Gro u p ) ; t h e 4 . 4 BS D re le a s e fro m t h e Un ive rs it y o f Ca lifo rn ia a t Be rke le y ( 4 . 4 BS D) ; Dig it a l Un ix fro m Dig it a l Eq u ip m e n t Co rp o ra t io n ( n o w He wle t t Pa cka rd ) ; AIX fro m IBM; HP- UX fro m He wle t t - Pa cka rd ; S o la ris fro m S u n Micro s ys t e m s ; a n d Ma c OS X fro m Ap p le Co m p u t e r, In c. Lin u x wa s in it ia lly d e ve lo p e d b y Lin u s To rva ld s in 1 9 9 1 a s a n o p e ra t in g s ys t e m fo r IBMco m p a t ib le p e rs o n a l co m p u t e rs b a s e d o n t h e In t e l 8 0 3 8 6 m icro p ro ce s s o r. Lin u s re m a in s d e e p ly in vo lve d wit h im p ro vin g Lin u x, ke e p in g it u p t o d a t e wit h va rio u s h a rd wa re d e ve lo p m e n t s a n d co o rd in a t in g t h e a ct ivit y o f h u n d re d s o f Lin u x d e ve lo p e rs a ro u n d t h e wo rld . Ove r t h e ye a rs , d e ve lo p e rs h a ve wo rke d t o m a ke Lin u x a va ila b le o n o t h e r a rch it e ct u re s , in clu d in g He wle t t - Pa cka rd 's Alp h a , It a n iu m ( t h e re ce n t In t e l's 6 4 - b it p ro ce s s o r) , MIPS , S PARC, Mo t o ro la MC6 8 0 x0 , Po we rPC, a n d IBM's zS e rie s . On e o f t h e m o re a p p e a lin g b e n e fit s t o Lin u x is t h a t it is n 't a co m m e rcia l o p e ra t in g s ys t e m : it s s o u rce co d e u n d e r t h e GNU Pu b lic Lice n s e [ 1 ] is o p e n a n d a va ila b le t o a n yo n e t o s t u d y ( a s we will in t h is b o o k) ; if yo u d o wn lo a d t h e co d e ( t h e o fficia l s it e is h t t p : / / www. ke rn e l. o rg ) o r ch e ck t h e s o u rce s o n a Lin u x CD, yo u will b e a b le t o e xp lo re , fro m t o p t o b o t t o m , o n e o f t h e m o s t s u cce s s fu l, m o d e rn o p e ra t in g s ys t e m s . Th is b o o k, in fa ct , a s s u m e s yo u h a ve t h e s o u rce co d e o n h a n d a n d ca n a p p ly wh a t we s a y t o yo u r o wn e xp lo ra t io n s . [1]

Th e GNU p ro je ct is co o rd in a t e d b y t h e Fre e S o ft wa re Fo u n d a t io n , In c. ( h t t p : / / www. g n u . o rg ) ; it s a im is t o im p le m e n t a wh o le o p e ra t in g s ys t e m fre e ly u s a b le b y e ve ryo n e . Th e a va ila b ilit y o f a GNU C co m p ile r h a s b e e n e s s e n t ia l fo r t h e s u cce s s o f t h e Lin u x p ro je ct .

Te ch n ica lly s p e a kin g , Lin u x is a t ru e Un ix ke rn e l, a lt h o u g h it is n o t a fu ll Un ix o p e ra t in g s ys t e m b e ca u s e it d o e s n o t in clu d e a ll t h e Un ix a p p lica t io n s , s u ch a s file s ys t e m u t ilit ie s , win d o win g s ys t e m s a n d g ra p h ica l d e s kt o p s , s ys t e m a d m in is t ra t o r co m m a n d s , t e xt e d it o rs , co m p ile rs , a n d s o o n . Ho we ve r, s in ce m o s t o f t h e s e p ro g ra m s a re fre e ly a va ila b le u n d e r t h e GNU Ge n e ra l Pu b lic Lice n s e , t h e y ca n b e in s t a lle d o n t o o n e o f t h e file s ys t e m s s u p p o rt e d b y Lin u x. S in ce t h e Lin u x ke rn e l re q u ire s s o m u ch a d d it io n a l s o ft wa re t o p ro vid e a u s e fu l e n viro n m e n t , m a n y Lin u x u s e rs p re fe r t o re ly o n co m m e rcia l d is t rib u t io n s , a va ila b le o n CD- ROM, t o g e t t h e co d e in clu d e d in a s t a n d a rd Un ix s ys t e m . Alt e rn a t ive ly, t h e co d e m a y b e o b t a in e d fro m s e ve ra l d iffe re n t FTP s it e s . Th e Lin u x s o u rce co d e is u s u a lly in s t a lle d in t h e / u s r/ s rc/ lin u x d ire ct o ry. In t h e re s t o f t h is b o o k, a ll file p a t h n a m e s will re fe r im p licit ly t o t h a t d ire ct o ry.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.1 Linux Versus Other Unix-Like Kernels Th e va rio u s Un ix- like s ys t e m s o n t h e m a rke t , s o m e o f wh ich h a ve a lo n g h is t o ry a n d s h o w s ig n s o f a rch a ic p ra ct ice s , d iffe r in m a n y im p o rt a n t re s p e ct s . All co m m e rcia l va ria n t s we re d e rive d fro m e it h e r S VR4 o r 4 . 4 BS D, a n d a ll t e n d t o a g re e o n s o m e co m m o n s t a n d a rd s like IEEE's Po rt a b le Op e ra t in g S ys t e m s b a s e d o n Un ix ( POS IX) a n d X/ Op e n 's Co m m o n Ap p lica t io n s En viro n m e n t ( CAE) . Th e cu rre n t s t a n d a rd s s p e cify o n ly a n a p p lica t io n p ro g ra m m in g in t e rfa ce ( API) —t h a t is , a we ll- d e fin e d e n viro n m e n t in wh ich u s e r p ro g ra m s s h o u ld ru n . Th e re fo re , t h e s t a n d a rd s d o n o t im p o s e a n y re s t rict io n o n in t e rn a l d e s ig n ch o ice s o f a co m p lia n t ke rn e l. [ 2 ] [2]

As a m a t t e r o f fa ct , s e ve ra l n o n - Un ix o p e ra t in g s ys t e m s , s u ch a s Win d o ws NT, a re POS IX- co m p lia n t .

To d e fin e a co m m o n u s e r in t e rfa ce , Un ix- like ke rn e ls o ft e n s h a re fu n d a m e n t a l d e s ig n id e a s a n d fe a t u re s . In t h is re s p e ct , Lin u x is co m p a ra b le wit h t h e o t h e r Un ix- like o p e ra t in g s ys t e m s . Re a d in g t h is b o o k a n d s t u d yin g t h e Lin u x ke rn e l, t h e re fo re , m a y h e lp yo u u n d e rs t a n d t h e o t h e r Un ix va ria n t s t o o . Th e 2 . 4 ve rs io n o f t h e Lin u x ke rn e l a im s t o b e co m p lia n t wit h t h e IEEE POS IX s t a n d a rd . Th is , o f co u rs e , m e a n s t h a t m o s t e xis t in g Un ix p ro g ra m s ca n b e co m p ile d a n d e xe cu t e d o n a Lin u x s ys t e m wit h ve ry lit t le e ffo rt o r e ve n wit h o u t t h e n e e d fo r p a t ch e s t o t h e s o u rce co d e . Mo re o ve r, Lin u x in clu d e s a ll t h e fe a t u re s o f a m o d e rn Un ix o p e ra t in g s ys t e m , s u ch a s virt u a l m e m o ry, a virt u a l file s ys t e m , lig h t we ig h t p ro ce s s e s , re lia b le s ig n a ls , S VR4 in t e rp ro ce s s co m m u n ica t io n s , s u p p o rt fo r S ym m e t ric Mu lt ip ro ce s s o r ( S MP) s ys t e m s , a n d s o o n . By it s e lf, t h e Lin u x ke rn e l is n o t ve ry in n o va t ive . Wh e n Lin u s To rva ld s wro t e t h e firs t ke rn e l, h e re fe rre d t o s o m e cla s s ica l b o o ks o n Un ix in t e rn a ls , like Ma u rice Ba ch 's Th e De s ig n o f t h e Un ix Op e ra t in g S y s t e m ( Pre n t ice Ha ll, 1 9 8 6 ) . Act u a lly, Lin u x s t ill h a s s o m e b ia s t o wa rd t h e Un ix b a s e lin e d e s crib e d in Ba ch 's b o o k ( i. e . , S VR4 ) . Ho we ve r, Lin u x d o e s n 't s t ick t o a n y p a rt icu la r va ria n t . In s t e a d , it t rie s t o a d o p t t h e b e s t fe a t u re s a n d d e s ig n ch o ice s o f s e ve ra l d iffe re n t Un ix ke rn e ls . Th e fo llo win g lis t d e s crib e s h o w Lin u x co m p e t e s a g a in s t s o m e we ll- kn o wn co m m e rcia l Un ix ke rn e ls : Mo n o lit h ic k e rn e l It is a la rg e , co m p le x d o - it - yo u rs e lf p ro g ra m , co m p o s e d o f s e ve ra l lo g ica lly d iffe re n t co m p o n e n t s . In t h is , it is q u it e co n ve n t io n a l; m o s t co m m e rcia l Un ix va ria n t s a re m o n o lit h ic. ( A n o t a b le e xce p t io n is Ca rn e g ie - Me llo n 's Ma ch 3 . 0 , wh ich fo llo ws a m icro ke rn e l a p p ro a ch . ) Co m p ile d a n d s t a t ica lly lin k e d t ra d it io n a l Un ix k e rn e ls Mo s t m o d e rn ke rn e ls ca n d yn a m ica lly lo a d a n d u n lo a d s o m e p o rt io n s o f t h e ke rn e l co d e ( t yp ica lly, d e vice d rive rs ) , wh ich a re u s u a lly ca lle d m o d u le s . Lin u x's s u p p o rt fo r m o d u le s is ve ry g o o d , s in ce it is a b le t o a u t o m a t ica lly lo a d a n d u n lo a d m o d u le s o n d e m a n d . Am o n g t h e m a in co m m e rcia l Un ix va ria n t s , o n ly t h e S VR4 . 2 a n d S o la ris ke rn e ls h a ve a s im ila r fe a t u re .

Ke rn e l t h re a d in g S o m e m o d e rn Un ix ke rn e ls , s u ch a s S o la ris 2 . x a n d S VR4 . 2 / MP, a re o rg a n ize d a s a s e t o f ke rn e l t h re a d s . A ke rn e l t h re a d is a n e xe cu t io n co n t e xt t h a t ca n b e in d e p e n d e n t ly s ch e d u le d ; it m a y b e a s s o cia t e d wit h a u s e r p ro g ra m , o r it m a y ru n o n ly s o m e ke rn e l fu n ct io n s . Co n t e xt s wit ch e s b e t we e n ke rn e l t h re a d s a re u s u a lly m u ch le s s e xp e n s ive t h a n co n t e xt s wit ch e s b e t we e n o rd in a ry p ro ce s s e s , s in ce t h e fo rm e r u s u a lly o p e ra t e o n a co m m o n a d d re s s s p a ce . Lin u x u s e s ke rn e l t h re a d s in a ve ry lim it e d wa y t o e xe cu t e a fe w ke rn e l fu n ct io n s p e rio d ica lly; s in ce Lin u x ke rn e l t h re a d s ca n n o t e xe cu t e u s e r p ro g ra m s , t h e y d o n o t re p re s e n t t h e b a s ic e xe cu t io n co n t e xt a b s t ra ct io n . ( Th a t 's t h e t o p ic o f t h e n e xt it e m . ) Mu lt it h re a d e d a p p lica t io n s u p p o rt Mo s t m o d e rn o p e ra t in g s ys t e m s h a ve s o m e kin d o f s u p p o rt fo r m u lt it h re a d e d a p p lica t io n s — t h a t is , u s e r p ro g ra m s t h a t a re we ll d e s ig n e d in t e rm s o f m a n y re la t ive ly in d e p e n d e n t e xe cu t io n flo ws t h a t s h a re a la rg e p o rt io n o f t h e a p p lica t io n d a t a s t ru ct u re s . A m u lt it h re a d e d u s e r a p p lica t io n co u ld b e co m p o s e d o f m a n y lig h t w e ig h t p ro ce s s e s ( LWP) , wh ich a re p ro ce s s e s t h a t ca n o p e ra t e o n a co m m o n a d d re s s s p a ce , co m m o n p h ys ica l m e m o ry p a g e s , co m m o n o p e n e d file s , a n d s o o n . Lin u x d e fin e s it s o wn ve rs io n o f lig h t we ig h t p ro ce s s e s , wh ich is d iffe re n t fro m t h e t yp e s u s e d o n o t h e r s ys t e m s s u ch a s S VR4 a n d S o la ris . Wh ile a ll t h e co m m e rcia l Un ix va ria n t s o f LWP a re b a s e d o n ke rn e l t h re a d s , Lin u x re g a rd s lig h t we ig h t p ro ce s s e s a s t h e b a s ic e xe cu t io n co n t e xt a n d h a n d le s t h e m via t h e n o n s t a n d a rd clone( ) s ys t e m ca ll.

No n p re e m p t iv e k e rn e l Lin u x 2 . 4 ca n n o t a rb it ra rily in t e rle a ve e xe cu t io n flo ws wh ile t h e y a re in p rivile g e d m o d e . [ 3 ] S e ve ra l s e ct io n s o f ke rn e l co d e a s s u m e t h e y ca n ru n a n d m o d ify d a t a s t ru ct u re s wit h o u t fe a r o f b e in g in t e rru p t e d a n d h a vin g a n o t h e r t h re a d a lt e r t h o s e d a t a s t ru ct u re s . Us u a lly, fu lly p re e m p t ive ke rn e ls a re a s s o cia t e d wit h s p e cia l re a lt im e o p e ra t in g s ys t e m s . Cu rre n t ly, a m o n g co n ve n t io n a l, g e n e ra l- p u rp o s e Un ix s ys t e m s , o n ly S o la ris 2 . x a n d Ma ch 3 . 0 a re fu lly p re e m p t ive ke rn e ls . S VR4 . 2 / MP in t ro d u ce s s o m e fix e d p re e m p t io n p o in t s a s a m e t h o d t o g e t lim it e d p re e m p t io n ca p a b ilit y. [3]

Th is re s t rict io n h a s b e e n re m o ve d in t h e Lin u x 2 . 5 d e ve lo p m e n t ve rs io n .

Mu lt ip ro ce s s o r s u p p o rt S e ve ra l Un ix ke rn e l va ria n t s t a ke a d va n t a g e o f m u lt ip ro ce s s o r s ys t e m s . Lin u x 2 . 4 s u p p o rt s s ym m e t ric m u lt ip ro ce s s in g ( S MP) : t h e s ys t e m ca n u s e m u lt ip le p ro ce s s o rs a n d e a ch p ro ce s s o r ca n h a n d le a n y t a s k — t h e re is n o d is crim in a t io n a m o n g t h e m . Alt h o u g h a fe w p a rt s o f t h e ke rn e l co d e a re s t ill s e ria lize d b y m e a n s o f a s in g le "b ig ke rn e l lo ck, " it is fa ir t o s a y t h a t Lin u x 2 . 4 m a ke s a n e a r o p t im a l u s e o f S MP. File s y s t e m Lin u x's s t a n d a rd file s ys t e m s co m e in m a n y fla vo rs , Yo u ca n u s e t h e p la in o ld Ext 2 file s ys t e m if yo u d o n 't h a ve s p e cific n e e d s . Yo u m ig h t s wit ch t o Ext 3 if yo u wa n t t o a vo id le n g t h y file s ys t e m ch e cks a ft e r a s ys t e m cra s h . If yo u 'll h a ve t o d e a l wit h

m a n y s m a ll file s , t h e Re is e rFS file s ys t e m is like ly t o b e t h e b e s t ch o ice . Be s id e s Ext 3 a n d Re is e rFS , s e ve ra l o t h e r jo u rn a lin g file s ys t e m s ca n b e u s e d in Lin u x, e ve n if t h e y a re n o t in clu d e d in t h e va n illa Lin u x t re e ; t h e y in clu d e IBM AIX's Jo u rn a lin g File S ys t e m ( JFS ) a n d S ilico n Gra p h ics Irix's XFS file s ys t e m . Th a n ks t o a p o we rfu l o b je ct o rie n t e d Virt u a l File S ys t e m t e ch n o lo g y ( in s p ire d b y S o la ris a n d S VR4 ) , p o rt in g a fo re ig n file s ys t e m t o Lin u x is a re la t ive ly e a s y t a s k. S TREAMS Lin u x h a s n o a n a lo g t o t h e S TREAMS I/ O s u b s ys t e m in t ro d u ce d in S VR4 , a lt h o u g h it is in clu d e d n o w in m o s t Un ix ke rn e ls a n d h a s b e co m e t h e p re fe rre d in t e rfa ce fo r writ in g d e vice d rive rs , t e rm in a l d rive rs , a n d n e t wo rk p ro t o co ls . Th is s o m e wh a t m o d e s t a s s e s s m e n t d o e s n o t d e p ict , h o we ve r, t h e wh o le t ru t h . S e ve ra l fe a t u re s m a ke Lin u x a wo n d e rfu lly u n iq u e o p e ra t in g s ys t e m . Co m m e rcia l Un ix ke rn e ls o ft e n in t ro d u ce n e w fe a t u re s t o g a in a la rg e r s lice o f t h e m a rke t , b u t t h e s e fe a t u re s a re n o t n e ce s s a rily u s e fu l, s t a b le , o r p ro d u ct ive . As a m a t t e r o f fa ct , m o d e rn Un ix ke rn e ls t e n d t o b e q u it e b lo a t e d . By co n t ra s t , Lin u x d o e s n 't s u ffe r fro m t h e re s t rict io n s a n d t h e co n d it io n in g im p o s e d b y t h e m a rke t , h e n ce it ca n fre e ly e vo lve a cco rd in g t o t h e id e a s o f it s d e s ig n e rs ( m a in ly Lin u s To rva ld s ) . S p e cifica lly, Lin u x o ffe rs t h e fo llo win g a d va n t a g e s o ve r it s co m m e rcia l co m p e t it o rs : ●



Lin u x is fr e e . Yo u ca n in s t a ll a co m p le t e Un ix s ys t e m a t n o e xp e n s e o t h e r t h a n t h e h a rd wa re ( o f co u rs e ) . Lin u x is fu lly c u s t o m iz a b le in a ll it s c o m p o n e n t s . Th a n ks t o t h e Ge n e ra l Pu b lic Lice n s e ( GPL) , yo u a re a llo we d t o fre e ly re a d a n d m o d ify t h e s o u rce co d e o f t h e ke rn e l a n d o f a ll s ys t e m p ro g ra m s . [ 4 ] [4]

S e ve ra l co m m e rcia l co m p a n ie s h a ve s t a rt e d t o s u p p o rt t h e ir p ro d u ct s u n d e r Lin u x. Ho we ve r, m o s t o f t h e m a re n 't d is t rib u t e d u n d e r a n o p e n s o u rce lice n s e , s o yo u m ig h t n o t b e a llo we d t o re a d o r m o d ify t h e ir s o u rce co d e .











Lin u x r u n s o n lo w - e n d , c h e a p h a r d w a r e p la t fo r m s . Yo u ca n e ve n b u ild a n e t wo rk s e rve r u s in g a n o ld In t e l 8 0 3 8 6 s ys t e m wit h 4 MB o f RAM. Lin u x is p o w e r fu l. Lin u x s ys t e m s a re ve ry fa s t , s in ce t h e y fu lly e xp lo it t h e fe a t u re s o f t h e h a rd wa re co m p o n e n t s . Th e m a in Lin u x g o a l is e fficie n cy, a n d in d e e d m a n y d e s ig n ch o ice s o f co m m e rcia l va ria n t s , like t h e S TREAMS I/ O s u b s ys t e m , h a ve b e e n re je ct e d b y Lin u s b e ca u s e o f t h e ir im p lie d p e rfo rm a n ce p e n a lt y. Lin u x h a s a h ig h s t a n d a r d fo r s o u r c e c o d e q u a lit y . Lin u x s ys t e m s a re u s u a lly ve ry s t a b le ; t h e y h a ve a ve ry lo w fa ilu re ra t e a n d s ys t e m m a in t e n a n ce t im e . Th e Lin u x k e r n e l c a n b e v e r y s m a ll a n d c o m p a c t . It is p o s s ib le t o fit b o t h a ke rn e l im a g e a n d fu ll ro o t file s ys t e m , in clu d in g a ll fu n d a m e n t a l s ys t e m p ro g ra m s , o n ju s t o n e 1 . 4 MB flo p p y d is k. As fa r a s we kn o w, n o n e o f t h e co m m e rcia l Un ix va ria n t s is a b le t o b o o t fro m a s in g le flo p p y d is k. Lin u x is h ig h ly c o m p a t ib le w it h m a n y c o m m o n o p e r a t in g s y s t e m s . It le t s yo u d ire ct ly m o u n t file s ys t e m s fo r a ll ve rs io n s o f MS - DOS a n d MS Win d o ws , S VR4 , OS / 2 , Ma c OS , S o la ris , S u n OS , Ne XTS TEP, m a n y BS D va ria n t s , a n d s o o n . Lin u x is a ls o a b le t o o p e ra t e wit h m a n y n e t wo rk la ye rs , s u ch a s Et h e rn e t ( a s we ll a s Fa s t Et h e rn e t a n d Gig a b it Et h e rn e t ) , Fib e r Dis t rib u t e d Da t a In t e rfa ce ( FDDI) , Hig h Pe rfo rm a n ce Pa ra lle l In t e rfa ce ( HIPPI) , IBM's To ke n Rin g , AT&T Wa ve LAN, a n d DEC Ro a m Ab o u t DS . By u s in g s u it a b le lib ra rie s , Lin u x s ys t e m s a re e ve n a b le t o d ire ct ly ru n p ro g ra m s writ t e n fo r o t h e r o p e ra t in g s ys t e m s . Fo r e xa m p le , Lin u x is a b le t o e xe cu t e a p p lica t io n s writ t e n fo r MS - DOS , MS Win d o ws , S VR3 a n d R4 , 4 . 4 BS D, S CO Un ix, XENIX, a n d o t h e rs o n t h e 8 0 x 8 6 p la t fo rm .



Lin u x is w e ll s u p p o r t e d . Be lie ve it o r n o t , it m a y b e a lo t e a s ie r t o g e t p a t ch e s a n d u p d a t e s fo r Lin u x t h a n fo r a n y o t h e r p ro p rie t a ry o p e ra t in g s ys t e m . Th e a n s we r t o a p ro b le m o ft e n co m e s b a ck wit h in a fe w h o u rs a ft e r s e n d in g a m e s s a g e t o s o m e n e ws g ro u p o r m a ilin g lis t . Mo re o ve r, d rive rs fo r Lin u x a re u s u a lly a va ila b le a fe w we e ks a ft e r n e w h a rd wa re p ro d u ct s h a ve b e e n in t ro d u ce d o n t h e m a rke t . By co n t ra s t , h a rd wa re m a n u fa ct u re rs re le a s e d e vice d rive rs fo r o n ly a fe w co m m e rcia l o p e ra t in g s ys t e m s — u s u a lly Micro s o ft 's . Th e re fo re , a ll co m m e rcia l Un ix va ria n t s ru n o n a re s t rict e d s u b s e t o f h a rd wa re co m p o n e n t s .

Wit h a n e s t im a t e d in s t a lle d b a s e o f s e ve ra l t e n s o f m illio n s , p e o p le wh o a re u s e d t o ce rt a in fe a t u re s t h a t a re s t a n d a rd u n d e r o t h e r o p e ra t in g s ys t e m s a re s t a rt in g t o e xp e ct t h e s a m e fro m Lin u x. In t h a t re g a rd , t h e d e m a n d o n Lin u x d e ve lo p e rs is a ls o in cre a s in g . Lu ckily, t h o u g h , Lin u x h a s e vo lve d u n d e r t h e clo s e d ire ct io n o f Lin u s t o a cco m m o d a t e t h e n e e d s o f the m a sse s. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.2 Hardware Dependency Lin u x t rie s t o m a in t a in a n e a t d is t in ct io n b e t we e n h a rd wa re - d e p e n d e n t a n d h a rd wa re in d e p e n d e n t s o u rce co d e . To t h a t e n d , b o t h t h e a rch a n d t h e in clu d e d ire ct o rie s in clu d e n in e s u b d ire ct o rie s t h a t co rre s p o n d t o t h e n in e h a rd wa re p la t fo rm s s u p p o rt e d . Th e s t a n d a rd n a m e s o f t h e p la t fo rm s a re : a lp h a He wle t t - Pa cka rd 's Alp h a wo rks t a t io n s a rm ARM p ro ce s s o r- b a s e d co m p u t e rs a n d e m b e d d e d d e vice s cris "Co d e Re d u ce d In s t ru ct io n S e t " CPUs u s e d b y Axis in it s t h in - s e rve rs , s u ch a s we b ca m e ra s o r d e ve lo p m e n t b o a rd s i3 8 6 IBM- co m p a t ib le p e rs o n a l co m p u t e rs b a s e d o n 8 0 x 8 6 m icro p ro ce s s o rs ia 6 4 Wo rks t a t io n s b a s e d o n In t e l 6 4 - b it It a n iu m m icro p ro ce s s o r m 68k Pe rs o n a l co m p u t e rs b a s e d o n Mo t o ro la MC6 8 0 x 0 m icro p ro ce s s o rs m ip s Wo rks t a t io n s b a s e d o n MIPS m icro p ro ce s s o rs m ip s 6 4 Wo rks t a t io n s b a s e d o n 6 4 - b it MIPS m icro p ro ce s s o rs p a ris c Wo rks t a t io n s b a s e d o n He wle t t Pa cka rd HP 9 0 0 0 PA- RIS C m icro p ro ce s s o rs ppc Wo rks t a t io n s b a s e d o n Mo t o ro la - IBM Po we rPC m icro p ro ce s s o rs

s390 3 2 - b it IBM ES A/ 3 9 0 a n d zS e rie s m a in fra m e s s390 x IBM 6 4 - b it zS e rie s s e rve rs sh S u p e rH e m b e d d e d co m p u t e rs d e ve lo p e d jo in t ly b y Hit a ch i a n d S TMicro e le ct ro n ics s p a rc Wo rks t a t io n s b a s e d o n S u n Micro s ys t e m s S PARC m icro p ro ce s s o rs s p a rc6 4 Wo rks t a t io n s b a s e d o n S u n Micro s ys t e m s 6 4 - b it Ult ra S PARC m icro p ro ce s s o rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.3 Linux Versions Lin u x d is t in g u is h e s s t a b le ke rn e ls fro m d e ve lo p m e n t ke rn e ls t h ro u g h a s im p le n u m b e rin g s ch e m e . Ea ch ve rs io n is ch a ra ct e rize d b y t h re e n u m b e rs , s e p a ra t e d b y p e rio d s . Th e firs t t wo n u m b e rs a re u s e d t o id e n t ify t h e ve rs io n ; t h e t h ird n u m b e r id e n t ifie s t h e re le a s e . As s h o wn in Fig u re 1 - 1 , if t h e s e co n d n u m b e r is e ve n , it d e n o t e s a s t a b le ke rn e l; o t h e rwis e , it d e n o t e s a d e ve lo p m e n t ke rn e l. At t h e t im e o f t h is writ in g , t h e cu rre n t s t a b le ve rs io n o f t h e Lin u x ke rn e l is 2 . 4 . 1 8 , a n d t h e cu rre n t d e ve lo p m e n t ve rs io n is 2 . 5 . 2 2 . Th e 2 . 4 ke rn e l — wh ich is t h e b a s is fo r t h is b o o k — wa s firs t re le a s e d in Ja n u a ry 2 0 0 1 a n d d iffe rs co n s id e ra b ly fro m t h e 2 . 2 ke rn e l, p a rt icu la rly wit h re s p e ct t o m e m o ry m a n a g e m e n t . Wo rk o n t h e 2 . 5 d e ve lo p m e n t ve rs io n s t a rt e d in No ve m b e r 2 0 0 1 . Fig u re 1 - 1 . N u m b e rin g Lin u x v e rs io n s

Ne w re le a s e s o f a s t a b le ve rs io n co m e o u t m o s t ly t o fix b u g s re p o rt e d b y u s e rs . Th e m a in a lg o rit h m s a n d d a t a s t ru ct u re s u s e d t o im p le m e n t t h e ke rn e l a re le ft u n ch a n g e d . [ 5 ] [5]

Th e p ra ct ice d o e s n o t a lwa ys fo llo w t h e t h e o ry. Fo r in s t a n ce , t h e virt u a l m e m o ry s ys t e m h a s b e e n s ig n ifica n t ly ch a n g e d , s t a rt in g wit h t h e 2 . 4 . 1 0 re le a s e .

De ve lo p m e n t ve rs io n s , o n t h e o t h e r h a n d , m a y d iffe r q u it e s ig n ifica n t ly fro m o n e a n o t h e r; ke rn e l d e ve lo p e rs a re fre e t o e xp e rim e n t wit h d iffe re n t s o lu t io n s t h a t o cca s io n a lly le a d t o d ra s t ic ke rn e l ch a n g e s . Us e rs wh o re ly o n d e ve lo p m e n t ve rs io n s fo r ru n n in g a p p lica t io n s m a y e xp e rie n ce u n p le a s a n t s u rp ris e s wh e n u p g ra d in g t h e ir ke rn e l t o a n e we r re le a s e . Th is b o o k co n ce n t ra t e s o n t h e m o s t re ce n t s t a b le ke rn e l t h a t we h a d a va ila b le b e ca u s e , a m o n g a ll t h e n e w fe a t u re s b e in g t rie d in e xp e rim e n t a l ke rn e ls , t h e re 's n o wa y o f t e llin g wh ich will u lt im a t e ly b e a cce p t e d a n d wh a t t h e y'll lo o k like in t h e ir fin a l fo rm . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.4 Basic Operating System Concepts Ea ch co m p u t e r s ys t e m in clu d e s a b a s ic s e t o f p ro g ra m s ca lle d t h e o p e ra t in g s y s t e m . Th e m o s t im p o rt a n t p ro g ra m in t h e s e t is ca lle d t h e k e rn e l. It is lo a d e d in t o RAM wh e n t h e s ys t e m b o o t s a n d co n t a in s m a n y crit ica l p ro ce d u re s t h a t a re n e e d e d fo r t h e s ys t e m t o o p e ra t e . Th e o t h e r p ro g ra m s a re le s s cru cia l u t ilit ie s ; t h e y ca n p ro vid e a wid e va rie t y o f in t e ra ct ive e xp e rie n ce s fo r t h e u s e r—a s we ll a s d o in g a ll t h e jo b s t h e u s e r b o u g h t t h e co m p u t e r fo r—b u t t h e e s s e n t ia l s h a p e a n d ca p a b ilit ie s o f t h e s ys t e m a re d e t e rm in e d b y t h e ke rn e l. Th e ke rn e l p ro vid e s ke y fa cilit ie s t o e ve ryt h in g e ls e o n t h e s ys t e m a n d d e t e rm in e s m a n y o f t h e ch a ra ct e ris t ics o f h ig h e r s o ft wa re . He n ce , we o ft e n u s e t h e t e rm "o p e ra t in g s ys t e m " a s a s yn o n ym fo r "ke rn e l. " Th e o p e ra t in g s ys t e m m u s t fu lfill t wo m a in o b je ct ive s : ●



In t e ra ct wit h t h e h a rd wa re co m p o n e n t s , s e rvicin g a ll lo w- le ve l p ro g ra m m a b le e le m e n t s in clu d e d in t h e h a rd wa re p la t fo rm . Pro vid e a n e xe cu t io n e n viro n m e n t t o t h e a p p lica t io n s t h a t ru n o n t h e co m p u t e r s ys t e m ( t h e s o - ca lle d u s e r p ro g ra m s ) .

S o m e o p e ra t in g s ys t e m s a llo w a ll u s e r p ro g ra m s t o d ire ct ly p la y wit h t h e h a rd wa re co m p o n e n t s ( a t yp ica l e xa m p le is MS - DOS ) . In co n t ra s t , a Un ix- like o p e ra t in g s ys t e m h id e s a ll lo w- le ve l d e t a ils co n ce rn in g t h e p h ys ica l o rg a n iza t io n o f t h e co m p u t e r fro m a p p lica t io n s ru n b y t h e u s e r. Wh e n a p ro g ra m wa n t s t o u s e a h a rd wa re re s o u rce , it m u s t is s u e a re q u e s t t o t h e o p e ra t in g s ys t e m . Th e ke rn e l e va lu a t e s t h e re q u e s t a n d , if it ch o o s e s t o g ra n t t h e re s o u rce , in t e ra ct s wit h t h e re la t ive h a rd wa re co m p o n e n t s o n b e h a lf o f t h e u s e r p ro g ra m . To e n fo rce t h is m e ch a n is m , m o d e rn o p e ra t in g s ys t e m s re ly o n t h e a va ila b ilit y o f s p e cific h a rd wa re fe a t u re s t h a t fo rb id u s e r p ro g ra m s t o d ire ct ly in t e ra ct wit h lo w- le ve l h a rd wa re co m p o n e n t s o r t o a cce s s a rb it ra ry m e m o ry lo ca t io n s . In p a rt icu la r, t h e h a rd wa re in t ro d u ce s a t le a s t t wo d iffe re n t e xe cu t io n m o d e s fo r t h e CPU: a n o n p rivile g e d m o d e fo r u s e r p ro g ra m s a n d a p rivile g e d m o d e fo r t h e ke rn e l. Un ix ca lls t h e s e Us e r Mo d e a n d Ke rn e l Mo d e , re s p e ct ive ly. In t h e re s t o f t h is ch a p t e r, we in t ro d u ce t h e b a s ic co n ce p t s t h a t h a ve m o t iva t e d t h e d e s ig n o f Un ix o ve r t h e p a s t t wo d e ca d e s , a s we ll a s Lin u x a n d o t h e r o p e ra t in g s ys t e m s . Wh ile t h e co n ce p t s a re p ro b a b ly fa m ilia r t o yo u a s a Lin u x u s e r, t h e s e s e ct io n s t ry t o d e lve in t o t h e m a b it m o re d e e p ly t h a n u s u a l t o e xp la in t h e re q u ire m e n t s t h e y p la ce o n a n o p e ra t in g s ys t e m ke rn e l. Th e s e b ro a d co n s id e ra t io n s re fe r t o virt u a lly a ll Un ix- like s ys t e m s . Th e o t h e r ch a p t e rs o f t h is b o o k will h o p e fu lly h e lp yo u u n d e rs t a n d t h e Lin u x ke rn e l in t e rn a ls .

1.4.1 Multiuser Systems A m u lt iu s e r s y s t e m is a co m p u t e r t h a t is a b le t o co n cu rre n t ly a n d in d e p e n d e n t ly e xe cu t e s e ve ra l a p p lica t io n s b e lo n g in g t o t wo o r m o re u s e rs . Co n cu rre n t ly m e a n s t h a t a p p lica t io n s ca n b e a ct ive a t t h e s a m e t im e a n d co n t e n d fo r t h e va rio u s re s o u rce s s u ch a s CPU, m e m o ry, h a rd d is ks , a n d s o o n . In d e p e n d e n t ly m e a n s t h a t e a ch a p p lica t io n ca n p e rfo rm it s t a s k wit h n o co n ce rn fo r wh a t t h e a p p lica t io n s o f t h e o t h e r u s e rs a re d o in g . S wit ch in g fro m o n e a p p lica t io n t o a n o t h e r, o f co u rs e , s lo ws d o wn e a ch o f t h e m a n d a ffe ct s t h e re s p o n s e t im e s e e n b y t h e u s e rs . Ma n y o f t h e co m p le xit ie s o f m o d e rn o p e ra t in g s ys t e m ke rn e ls , wh ich we will e xa m in e in t h is b o o k, a re p re s e n t t o m in im ize t h e d e la ys e n fo rce d o n e a ch p ro g ra m a n d t o p ro vid e t h e u s e r wit h re s p o n s e s t h a t a re a s fa s t a s p o s s ib le .

Mu lt iu s e r o p e ra t in g s ys t e m s m u s t in clu d e s e ve ra l fe a t u re s : ● ●





An a u t h e n t ica t io n m e ch a n is m fo r ve rifyin g t h e u s e r's id e n t it y A p ro t e ct io n m e ch a n is m a g a in s t b u g g y u s e r p ro g ra m s t h a t co u ld b lo ck o t h e r a p p lica t io n s ru n n in g in t h e s ys t e m A p ro t e ct io n m e ch a n is m a g a in s t m a licio u s u s e r p ro g ra m s t h a t co u ld in t e rfe re wit h o r s p y o n t h e a ct ivit y o f o t h e r u s e rs An a cco u n t in g m e ch a n is m t h a t lim it s t h e a m o u n t o f re s o u rce u n it s a s s ig n e d t o e a ch user

To e n s u re s a fe p ro t e ct io n m e ch a n is m s , o p e ra t in g s ys t e m s m u s t u s e t h e h a rd wa re p ro t e ct io n a s s o cia t e d wit h t h e CPU p rivile g e d m o d e . Ot h e rwis e , a u s e r p ro g ra m wo u ld b e a b le t o d ire ct ly a cce s s t h e s ys t e m circu it ry a n d o ve rco m e t h e im p o s e d b o u n d s . Un ix is a m u lt iu s e r s ys t e m t h a t e n fo rce s t h e h a rd wa re p ro t e ct io n o f s ys t e m re s o u rce s .

1.4.2 Users and Groups In a m u lt iu s e r s ys t e m , e a ch u s e r h a s a p riva t e s p a ce o n t h e m a ch in e ; t yp ica lly, h e o wn s s o m e q u o t a o f t h e d is k s p a ce t o s t o re file s , re ce ive s p riva t e m a il m e s s a g e s , a n d s o o n . Th e o p e ra t in g s ys t e m m u s t e n s u re t h a t t h e p riva t e p o rt io n o f a u s e r s p a ce is vis ib le o n ly t o it s o wn e r. In p a rt icu la r, it m u s t e n s u re t h a t n o u s e r ca n e xp lo it a s ys t e m a p p lica t io n fo r t h e p u rp o s e o f vio la t in g t h e p riva t e s p a ce o f a n o t h e r u s e r. All u s e rs a re id e n t ifie d b y a u n iq u e n u m b e r ca lle d t h e Us e r ID, o r UID. Us u a lly o n ly a re s t rict e d n u m b e r o f p e rs o n s a re a llo we d t o m a ke u s e o f a co m p u t e r s ys t e m . Wh e n o n e o f t h e s e u s e rs s t a rt s a wo rkin g s e s s io n , t h e o p e ra t in g s ys t e m a s ks fo r a lo g in n a m e a n d a p a s s w o rd . If t h e u s e r d o e s n o t in p u t a va lid p a ir, t h e s ys t e m d e n ie s a cce s s . S in ce t h e p a s s wo rd is a s s u m e d t o b e s e cre t , t h e u s e r's p riva cy is e n s u re d . To s e le ct ive ly s h a re m a t e ria l wit h o t h e r u s e rs , e a ch u s e r is a m e m b e r o f o n e o r m o re g ro u p s , wh ich a re id e n t ifie d b y a u n iq u e n u m b e r ca lle d a Gro u p ID, o r GID. Ea ch file is a s s o cia t e d wit h e xa ct ly o n e g ro u p . Fo r e xa m p le , a cce s s ca n b e s e t s o t h e u s e r o wn in g t h e file h a s re a d a n d writ e p rivile g e s , t h e g ro u p h a s re a d - o n ly p rivile g e s , a n d o t h e r u s e rs o n t h e s ys t e m a re d e n ie d a cce s s t o t h e file . An y Un ix- like o p e ra t in g s ys t e m h a s a s p e cia l u s e r ca lle d ro o t , s u p e ru s e r, o r s u p e rv is o r. Th e s ys t e m a d m in is t ra t o r m u s t lo g in a s ro o t t o h a n d le u s e r a cco u n t s , p e rfo rm m a in t e n a n ce t a s ks s u ch a s s ys t e m b a cku p s a n d p ro g ra m u p g ra d e s , a n d s o o n . Th e ro o t u s e r ca n d o a lm o s t e ve ryt h in g , s in ce t h e o p e ra t in g s ys t e m d o e s n o t a p p ly t h e u s u a l p ro t e ct io n m e ch a n is m s t o h e r. In p a rt icu la r, t h e ro o t u s e r ca n a cce s s e ve ry file o n t h e s ys t e m a n d ca n in t e rfe re wit h t h e a ct ivit y o f e ve ry ru n n in g u s e r p ro g ra m .

1.4.3 Processes All o p e ra t in g s ys t e m s u s e o n e fu n d a m e n t a l a b s t ra ct io n : t h e p ro ce s s . A p ro ce s s ca n b e d e fin e d e it h e r a s "a n in s t a n ce o f a p ro g ra m in e xe cu t io n " o r a s t h e "e xe cu t io n co n t e xt " o f a ru n n in g p ro g ra m . In t ra d it io n a l o p e ra t in g s ys t e m s , a p ro ce s s e xe cu t e s a s in g le s e q u e n ce o f in s t ru ct io n s in a n a d d re s s s p a ce ; t h e a d d re s s s p a ce is t h e s e t o f m e m o ry a d d re s s e s t h a t t h e p ro ce s s is a llo we d t o re fe re n ce . Mo d e rn o p e ra t in g s ys t e m s a llo w p ro ce s s e s wit h m u lt ip le e xe cu t io n flo ws — t h a t is , m u lt ip le s e q u e n ce s o f in s t ru ct io n s e xe cu t e d in t h e s a m e a d d re s s s p a ce . Mu lt iu s e r s ys t e m s m u s t e n fo rce a n e xe cu t io n e n viro n m e n t in wh ich s e ve ra l p ro ce s s e s ca n b e a ct ive co n cu rre n t ly a n d co n t e n d fo r s ys t e m re s o u rce s , m a in ly t h e CPU. S ys t e m s t h a t a llo w

co n cu rre n t a ct ive p ro ce s s e s a re s a id t o b e m u lt ip ro g ra m m in g o r m u lt ip ro ce s s in g . [ 6 ] It is im p o rt a n t t o d is t in g u is h p ro g ra m s fro m p ro ce s s e s ; s e ve ra l p ro ce s s e s ca n e xe cu t e t h e s a m e p ro g ra m co n cu rre n t ly, wh ile t h e s a m e p ro ce s s ca n e xe cu t e s e ve ra l p ro g ra m s s e q u e n t ia lly. [6]

S o m e m u lt ip ro ce s s in g o p e ra t in g s ys t e m s a re n o t m u lt iu s e r; a n e xa m p le is Micro s o ft 's Win d o ws 9 8 .

On u n ip ro ce s s o r s ys t e m s , ju s t o n e p ro ce s s ca n h o ld t h e CPU, a n d h e n ce ju s t o n e e xe cu t io n flo w ca n p ro g re s s a t a t im e . In g e n e ra l, t h e n u m b e r o f CPUs is a lwa ys re s t rict e d , a n d t h e re fo re o n ly a fe w p ro ce s s e s ca n p ro g re s s a t o n ce . An o p e ra t in g s ys t e m co m p o n e n t ca lle d t h e s ch e d u le r ch o o s e s t h e p ro ce s s t h a t ca n p ro g re s s . S o m e o p e ra t in g s ys t e m s a llo w o n ly n o n p re e m p t iv e p ro ce s s e s , wh ich m e a n s t h a t t h e s ch e d u le r is in vo ke d o n ly wh e n a p ro ce s s vo lu n t a rily re lin q u is h e s t h e CPU. Bu t p ro ce s s e s o f a m u lt iu s e r s ys t e m m u s t b e p re e m p t iv e ; t h e o p e ra t in g s ys t e m t ra cks h o w lo n g e a ch p ro ce s s h o ld s t h e CPU a n d p e rio d ica lly a ct iva t e s t h e s ch e d u le r. Un ix is a m u lt ip ro ce s s in g o p e ra t in g s ys t e m wit h p re e m p t ive p ro ce s s e s . Eve n wh e n n o u s e r is lo g g e d in a n d n o a p p lica t io n is ru n n in g , s e ve ra l s ys t e m p ro ce s s e s m o n it o r t h e p e rip h e ra l d e vice s . In p a rt icu la r, s e ve ra l p ro ce s s e s lis t e n a t t h e s ys t e m t e rm in a ls wa it in g fo r u s e r lo g in s . Wh e n a u s e r in p u t s a lo g in n a m e , t h e lis t e n in g p ro ce s s ru n s a p ro g ra m t h a t va lid a t e s t h e u s e r p a s s wo rd . If t h e u s e r id e n t it y is a ckn o wle d g e d , t h e p ro ce s s cre a t e s a n o t h e r p ro ce s s t h a t ru n s a s h e ll in t o wh ich co m m a n d s a re e n t e re d . Wh e n a g ra p h ica l d is p la y is a ct iva t e d , o n e p ro ce s s ru n s t h e win d o w m a n a g e r, a n d e a ch win d o w o n t h e d is p la y is u s u a lly ru n b y a s e p a ra t e p ro ce s s . Wh e n a u s e r cre a t e s a g ra p h ics s h e ll, o n e p ro ce s s ru n s t h e g ra p h ics win d o ws a n d a s e co n d p ro ce s s ru n s t h e s h e ll in t o wh ich t h e u s e r ca n e n t e r t h e co m m a n d s . Fo r e a ch u s e r co m m a n d , t h e s h e ll p ro ce s s cre a t e s a n o t h e r p ro ce s s t h a t e xe cu t e s t h e co rre s p o n d in g p ro g ra m . Un ix- like o p e ra t in g s ys t e m s a d o p t a p ro ce s s / k e rn e l m o d e l. Ea ch p ro ce s s h a s t h e illu s io n t h a t it 's t h e o n ly p ro ce s s o n t h e m a ch in e a n d it h a s e xclu s ive a cce s s t o t h e o p e ra t in g s ys t e m s e rvice s . Wh e n e ve r a p ro ce s s m a ke s a s ys t e m ca ll ( i. e . , a re q u e s t t o t h e ke rn e l) , t h e h a rd wa re ch a n g e s t h e p rivile g e m o d e fro m Us e r Mo d e t o Ke rn e l Mo d e , a n d t h e p ro ce s s s t a rt s t h e e xe cu t io n o f a ke rn e l p ro ce d u re wit h a s t rict ly lim it e d p u rp o s e . In t h is wa y, t h e o p e ra t in g s ys t e m a ct s wit h in t h e e xe cu t io n co n t e xt o f t h e p ro ce s s in o rd e r t o s a t is fy it s re q u e s t . Wh e n e ve r t h e re q u e s t is fu lly s a t is fie d , t h e ke rn e l p ro ce d u re fo rce s t h e h a rd wa re t o re t u rn t o Us e r Mo d e a n d t h e p ro ce s s co n t in u e s it s e xe cu t io n fro m t h e in s t ru ct io n fo llo win g t h e s ys t e m ca ll.

1.4.4 Kernel Architecture As s t a t e d b e fo re , m o s t Un ix ke rn e ls a re m o n o lit h ic: e a ch ke rn e l la ye r is in t e g ra t e d in t o t h e wh o le ke rn e l p ro g ra m a n d ru n s in Ke rn e l Mo d e o n b e h a lf o f t h e cu rre n t p ro ce s s . In co n t ra s t , m icro k e rn e l o p e ra t in g s ys t e m s d e m a n d a ve ry s m a ll s e t o f fu n ct io n s fro m t h e ke rn e l, g e n e ra lly in clu d in g a fe w s yn ch ro n iza t io n p rim it ive s , a s im p le s ch e d u le r, a n d a n in t e rp ro ce s s co m m u n ica t io n m e ch a n is m . S e ve ra l s ys t e m p ro ce s s e s t h a t ru n o n t o p o f t h e m icro ke rn e l im p le m e n t o t h e r o p e ra t in g s ys t e m - la ye r fu n ct io n s , like m e m o ry a llo ca t o rs , d e vice d rive rs , a n d s ys t e m ca ll h a n d le rs . Alt h o u g h a ca d e m ic re s e a rch o n o p e ra t in g s ys t e m s is o rie n t e d t o wa rd m icro ke rn e ls , s u ch o p e ra t in g s ys t e m s a re g e n e ra lly s lo we r t h a n m o n o lit h ic o n e s , s in ce t h e e xp licit m e s s a g e p a s s in g b e t we e n t h e d iffe re n t la ye rs o f t h e o p e ra t in g s ys t e m h a s a co s t . Ho we ve r, m icro ke rn e l o p e ra t in g s ys t e m s m ig h t h a ve s o m e t h e o re t ica l a d va n t a g e s o ve r m o n o lit h ic o n e s . Micro ke rn e ls fo rce t h e s ys t e m p ro g ra m m e rs t o a d o p t a m o d u la rize d a p p ro a ch , s in ce e a ch o p e ra t in g s ys t e m la ye r is a re la t ive ly in d e p e n d e n t p ro g ra m t h a t m u s t in t e ra ct wit h t h e o t h e r la ye rs t h ro u g h we ll- d e fin e d a n d cle a n s o ft wa re in t e rfa ce s . Mo re o ve r, a n e xis t in g

m icro ke rn e l o p e ra t in g s ys t e m ca n b e e a s ily p o rt e d t o o t h e r a rch it e ct u re s fa irly e a s ily, s in ce a ll h a rd wa re - d e p e n d e n t co m p o n e n t s a re g e n e ra lly e n ca p s u la t e d in t h e m icro ke rn e l co d e . Fin a lly, m icro ke rn e l o p e ra t in g s ys t e m s t e n d t o m a ke b e t t e r u s e o f ra n d o m a cce s s m e m o ry ( RAM) t h a n m o n o lit h ic o n e s , s in ce s ys t e m p ro ce s s e s t h a t a re n 't im p le m e n t in g n e e d e d fu n ct io n a lit ie s m ig h t b e s wa p p e d o u t o r d e s t ro ye d . To a ch ie ve m a n y o f t h e t h e o re t ica l a d va n t a g e s o f m icro ke rn e ls wit h o u t in t ro d u cin g p e rfo rm a n ce p e n a lt ie s , t h e Lin u x ke rn e l o ffe rs m o d u le s . A m o d u le is a n o b je ct file wh o s e co d e ca n b e lin ke d t o ( a n d u n lin ke d fro m ) t h e ke rn e l a t ru n t im e . Th e o b je ct co d e u s u a lly co n s is t s o f a s e t o f fu n ct io n s t h a t im p le m e n t s a file s ys t e m , a d e vice d rive r, o r o t h e r fe a t u re s a t t h e ke rn e l's u p p e r la ye r. Th e m o d u le , u n like t h e e xt e rn a l la ye rs o f m icro ke rn e l o p e ra t in g s ys t e m s , d o e s n o t ru n a s a s p e cific p ro ce s s . In s t e a d , it is e xe cu t e d in Ke rn e l Mo d e o n b e h a lf o f t h e cu rre n t p ro ce s s , like a n y o t h e r s t a t ica lly lin ke d ke rn e l fu n ct io n . Th e m a in a d va n t a g e s o f u s in g m o d u le s in clu d e : A m o d u la riz e d a p p ro a ch S in ce a n y m o d u le ca n b e lin ke d a n d u n lin ke d a t ru n t im e , s ys t e m p ro g ra m m e rs m u s t in t ro d u ce we ll- d e fin e d s o ft wa re in t e rfa ce s t o a cce s s t h e d a t a s t ru ct u re s h a n d le d b y m o d u le s . Th is m a ke s it e a s y t o d e ve lo p n e w m o d u le s . Pla t fo rm in d e p e n d e n ce Eve n if it m a y re ly o n s o m e s p e cific h a rd wa re fe a t u re s , a m o d u le d o e s n 't d e p e n d o n a fixe d h a rd wa re p la t fo rm . Fo r e xa m p le , a d is k d rive r m o d u le t h a t re lie s o n t h e S CS I s t a n d a rd wo rks a s we ll o n a n IBM- co m p a t ib le PC a s it d o e s o n He wle t t - Pa cka rd 's Alp h a . Fru g a l m a in m e m o ry u s a g e A m o d u le ca n b e lin ke d t o t h e ru n n in g ke rn e l wh e n it s fu n ct io n a lit y is re q u ire d a n d u n lin ke d wh e n it is n o lo n g e r u s e fu l. Th is m e ch a n is m a ls o ca n b e m a d e t ra n s p a re n t t o t h e u s e r, s in ce lin kin g a n d u n lin kin g ca n b e p e rfo rm e d a u t o m a t ica lly b y t h e ke rn e l. No p e rfo rm a n ce p e n a lt y On ce lin ke d in , t h e o b je ct co d e o f a m o d u le is e q u iva le n t t o t h e o b je ct co d e o f t h e s t a t ica lly lin ke d ke rn e l. Th e re fo re , n o e xp licit m e s s a g e p a s s in g is re q u ire d wh e n t h e fu n ct io n s o f t h e m o d u le a re in vo ke d . [ 7 ] [7]

A s m a ll p e rfo rm a n ce p e n a lt y o ccu rs wh e n t h e m o d u le is lin ke d a n d u n lin ke d . Ho we ve r, t h is p e n a lt y ca n b e co m p a re d t o t h e p e n a lt y ca u s e d b y t h e cre a t io n a n d d e le t io n o f s ys t e m p ro ce s s e s in m icro ke rn e l o p e ra t in g s ys t e m s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.5 An Overview of the Unix Filesystem Th e Un ix o p e ra t in g s ys t e m d e s ig n is ce n t e re d o n it s file s ys t e m , wh ich h a s s e ve ra l in t e re s t in g ch a ra ct e ris t ics . We 'll re vie w t h e m o s t s ig n ifica n t o n e s , s in ce t h e y will b e m e n t io n e d q u it e o ft e n in fo rt h co m in g ch a p t e rs .

1.5.1 Files A Un ix file is a n in fo rm a t io n co n t a in e r s t ru ct u re d a s a s e q u e n ce o f b yt e s ; t h e ke rn e l d o e s n o t in t e rp re t t h e co n t e n t s o f a file . Ma n y p ro g ra m m in g lib ra rie s im p le m e n t h ig h e r- le ve l a b s t ra ct io n s , s u ch a s re co rd s s t ru ct u re d in t o fie ld s a n d re co rd a d d re s s in g b a s e d o n ke ys . Ho we ve r, t h e p ro g ra m s in t h e s e lib ra rie s m u s t re ly o n s ys t e m ca lls o ffe re d b y t h e ke rn e l. Fro m t h e u s e r's p o in t o f vie w, file s a re o rg a n ize d in a t re e - s t ru ct u re d n a m e s p a ce , a s s h o wn in Fig u re 1 - 2 . Fig u re 1 - 2 . An e x a m p le o f a d ire c t o ry t re e

All t h e n o d e s o f t h e t re e , e xce p t t h e le a ve s , d e n o t e d ire ct o ry n a m e s . A d ire ct o ry n o d e co n t a in s in fo rm a t io n a b o u t t h e file s a n d d ire ct o rie s ju s t b e n e a t h it . A file o r d ire ct o ry n a m e co n s is t s o f a s e q u e n ce o f a rb it ra ry AS CII ch a ra ct e rs , [ 8 ] wit h t h e e xce p t io n o f / a n d o f t h e n u ll ch a ra ct e r \ 0 . Mo s t file s ys t e m s p la ce a lim it o n t h e le n g t h o f a file n a m e , t yp ica lly n o m o re t h a n 2 5 5 ch a ra ct e rs . Th e d ire ct o ry co rre s p o n d in g t o t h e ro o t o f t h e t re e is ca lle d t h e ro o t d ire ct o ry . By co n ve n t io n , it s n a m e is a s la s h ( /) . Na m e s m u s t b e d iffe re n t wit h in t h e s a m e d ire ct o ry, b u t t h e s a m e n a m e m a y b e u s e d in d iffe re n t d ire ct o rie s . [8]

S o m e o p e ra t in g s ys t e m s a llo w file n a m e s t o b e e xp re s s e d in m a n y d iffe re n t a lp h a b e t s , b a s e d o n 1 6 - b it e xt e n d e d co d in g o f g ra p h ica l ch a ra ct e rs s u ch a s Un ico d e .

Un ix a s s o cia t e s a cu rre n t w o rk in g d ire ct o ry wit h e a ch p ro ce s s ( s e e S e ct io n 1 . 6 . 1 la t e r in t h is ch a p t e r) ; it b e lo n g s t o t h e p ro ce s s e xe cu t io n co n t e xt , a n d it id e n t ifie s t h e d ire ct o ry cu rre n t ly u s e d b y t h e p ro ce s s . To id e n t ify a s p e cific file , t h e p ro ce s s u s e s a p a t h n a m e , wh ich co n s is t s o f s la s h e s a lt e rn a t in g wit h a s e q u e n ce o f d ire ct o ry n a m e s t h a t le a d t o t h e file . If t h e firs t it e m in t h e p a t h n a m e is a s la s h , t h e p a t h n a m e is s a id t o b e a b s o lu t e , s in ce it s s t a rt in g p o in t is t h e ro o t d ire ct o ry. Ot h e rwis e , if t h e firs t it e m is a d ire ct o ry n a m e o r file n a m e , t h e p a t h n a m e is s a id t o b e re la t iv e , s in ce it s s t a rt in g p o in t is t h e p ro ce s s 's cu rre n t d ire ct o ry.

Wh ile s p e cifyin g file n a m e s , t h e n o t a t io n s ". " a n d ". . " a re a ls o u s e d . Th e y d e n o t e t h e cu rre n t wo rkin g d ire ct o ry a n d it s p a re n t d ire ct o ry, re s p e ct ive ly. If t h e cu rre n t wo rkin g d ire ct o ry is t h e ro o t d ire ct o ry, ". " a n d ". . " co in cid e .

1.5.2 Hard and Soft Links A file n a m e in clu d e d in a d ire ct o ry is ca lle d a file h a rd lin k , o r m o re s im p ly, a lin k . Th e s a m e file m a y h a ve s e ve ra l lin ks in clu d e d in t h e s a m e d ire ct o ry o r in d iffe re n t o n e s , s o it m a y h a ve s e ve ra l file n a m e s . Th e Un ix co m m a n d :

$ ln f1 f2 is u s e d t o cre a t e a n e w h a rd lin k t h a t h a s t h e p a t h n a m e f2 fo r a file id e n t ifie d b y t h e p a t h n a m e f1.

Ha rd lin ks h a ve t wo lim it a t io n s : ●



Us e rs a re n o t a llo we d t o cre a t e h a rd lin ks fo r d ire ct o rie s . Th is m ig h t t ra n s fo rm t h e d ire ct o ry t re e in t o a g ra p h wit h cycle s , t h u s m a kin g it im p o s s ib le t o lo ca t e a file a cco rd in g t o it s n a m e . Lin ks ca n b e cre a t e d o n ly a m o n g file s in clu d e d in t h e s a m e file s ys t e m . Th is is a s e rio u s lim it a t io n , s in ce m o d e rn Un ix s ys t e m s m a y in clu d e s e ve ra l file s ys t e m s lo ca t e d o n d iffe re n t d is ks a n d / o r p a rt it io n s , a n d u s e rs m a y b e u n a wa re o f t h e p h ys ica l d ivis io n s b e t we e n t h e m .

To o ve rco m e t h e s e lim it a t io n s , s o ft lin k s ( a ls o ca lle d s y m b o lic lin k s ) h a ve b e e n in t ro d u ce d . S ym b o lic lin ks a re s h o rt file s t h a t co n t a in a n a rb it ra ry p a t h n a m e o f a n o t h e r file . Th e p a t h n a m e m a y re fe r t o a n y file lo ca t e d in a n y file s ys t e m ; it m a y e ve n re fe r t o a n o n e xis t e n t file . Th e Un ix co m m a n d :

$ ln -s f1 f2 cre a t e s a n e w s o ft lin k wit h p a t h n a m e f2 t h a t re fe rs t o p a t h n a m e f1. Wh e n t h is co m m a n d is e xe cu t e d , t h e file s ys t e m e xt ra ct s t h e d ire ct o ry p a rt o f f2 a n d cre a t e s a n e w e n t ry in t h a t d ire ct o ry o f t yp e s ym b o lic lin k, wit h t h e n a m e in d ica t e d b y f2. Th is n e w file co n t a in s t h e n a m e in d ica t e d b y p a t h n a m e f1. Th is wa y, e a ch re fe re n ce t o f2 ca n b e t ra n s la t e d a u t o m a t ica lly in t o a re fe re n ce t o f1.

1.5.3 File Types Un ix file s m a y h a ve o n e o f t h e fo llo win g t yp e s : ● ● ● ●

Re g u la r file Dire ct o ry S ym b o lic lin k Blo ck- o rie n t e d d e vice file

● ● ●

Ch a ra ct e r- o rie n t e d d e vice file Pip e a n d n a m e d p ip e ( a ls o ca lle d FIFO) S o cke t

Th e firs t t h re e file t yp e s a re co n s t it u e n t s o f a n y Un ix file s ys t e m . Th e ir im p le m e n t a t io n is d e s crib e d in d e t a il in Ch a p t e r 1 7 . De vice file s a re re la t e d t o I/ O d e vice s a n d d e vice d rive rs in t e g ra t e d in t o t h e ke rn e l. Fo r e xa m p le , wh e n a p ro g ra m a cce s s e s a d e vice file , it a ct s d ire ct ly o n t h e I/ O d e vice a s s o cia t e d wit h t h a t file ( s e e Ch a p t e r 1 3 ) . Pip e s a n d s o cke t s a re s p e cia l file s u s e d fo r in t e rp ro ce s s co m m u n ica t io n ( s e e S e ct io n 1 . 6 . 5 la t e r in t h is ch a p t e r; a ls o s e e Ch a p t e r 1 8 a n d Ch a p t e r 1 9 )

1.5.4 File Descriptor and Inode Un ix m a ke s a cle a r d is t in ct io n b e t we e n t h e co n t e n t s o f a file a n d t h e in fo rm a t io n a b o u t a file . Wit h t h e e xce p t io n o f d e vice a n d s p e cia l file s , e a ch file co n s is t s o f a s e q u e n ce o f ch a ra ct e rs . Th e file d o e s n o t in clu d e a n y co n t ro l in fo rm a t io n , s u ch a s it s le n g t h o r a n En d - OfFile ( EOF) d e lim it e r. All in fo rm a t io n n e e d e d b y t h e file s ys t e m t o h a n d le a file is in clu d e d in a d a t a s t ru ct u re ca lle d a n in o d e . Ea ch file h a s it s o wn in o d e , wh ich t h e file s ys t e m u s e s t o id e n t ify t h e file . Wh ile file s ys t e m s a n d t h e ke rn e l fu n ct io n s h a n d lin g t h e m ca n va ry wid e ly fro m o n e Un ix s ys t e m t o a n o t h e r, t h e y m u s t a lwa ys p ro vid e a t le a s t t h e fo llo win g a t t rib u t e s , wh ich a re s p e cifie d in t h e POS IX s t a n d a rd : ● ● ● ● ● ● ● ●



File t yp e ( s e e t h e p re vio u s s e ct io n ) Nu m b e r o f h a rd lin ks a s s o cia t e d wit h t h e file File le n g t h in b yt e s De vice ID ( i. e . , a n id e n t ifie r o f t h e d e vice co n t a in in g t h e file ) In o d e n u m b e r t h a t id e n t ifie s t h e file wit h in t h e file s ys t e m Us e r ID o f t h e file o wn e r Gro u p ID o f t h e file S e ve ra l t im e s t a m p s t h a t s p e cify t h e in o d e s t a t u s ch a n g e t im e , t h e la s t a cce s s t im e , a n d t h e la s t m o d ify t im e Acce s s rig h t s a n d file m o d e ( s e e t h e n e xt s e ct io n )

1.5.5 Access Rights and File Mode Th e p o t e n t ia l u s e rs o f a file fa ll in t o t h re e cla s s e s : ● ● ●

Th e u s e r wh o is t h e o wn e r o f t h e file Th e u s e rs wh o b e lo n g t o t h e s a m e g ro u p a s t h e file , n o t in clu d in g t h e o wn e r All re m a in in g u s e rs ( o t h e rs )

Th e re a re t h re e t yp e s o f a cce s s rig h t s — Re a d , W rit e , a n d Ex e cu t e — fo r e a ch o f t h e s e t h re e cla s s e s . Th u s , t h e s e t o f a cce s s rig h t s a s s o cia t e d wit h a file co n s is t s o f n in e d iffe re n t b in a ry fla g s . Th re e a d d it io n a l fla g s , ca lle d s u id ( S e t Us e r ID) , s g id ( S e t Gro u p ID) , a n d s t ick y , d e fin e t h e file m o d e . Th e s e fla g s h a ve t h e fo llo win g m e a n in g s wh e n a p p lie d t o e xe cu t a b le file s :

suid A p ro ce s s e xe cu t in g a file n o rm a lly ke e p s t h e Us e r ID ( UID) o f t h e p ro ce s s o wn e r. Ho we ve r, if t h e e xe cu t a b le file h a s t h e suid fla g s e t , t h e p ro ce s s g e t s t h e UID o f t h e file o wn e r.

sgid A p ro ce s s e xe cu t in g a file ke e p s t h e Gro u p ID ( GID) o f t h e p ro ce s s g ro u p . Ho we ve r, if t h e e xe cu t a b le file h a s t h e sgid fla g s e t , t h e p ro ce s s g e t s t h e ID o f t h e file g ro u p .

sticky An e xe cu t a b le file wit h t h e sticky fla g s e t co rre s p o n d s t o a re q u e s t t o t h e ke rn e l t o ke e p t h e p ro g ra m in m e m o ry a ft e r it s e xe cu t io n t e rm in a t e s . [ 9 ] [ 9 ] Th is fla g h a s b e co m e o b s o le t e ; o t h e r a p p ro a ch e s b a s e d o n s h a rin g o f co d e p a g e s a re n o w u s e d ( s e e Ch a p t e r 8 ) .

Wh e n a file is cre a t e d b y a p ro ce s s , it s o wn e r ID is t h e UID o f t h e p ro ce s s . It s o wn e r g ro u p ID ca n b e e it h e r t h e GID o f t h e cre a t o r p ro ce s s o r t h e GID o f t h e p a re n t d ire ct o ry, d e p e n d in g o n t h e va lu e o f t h e sgid fla g o f t h e p a re n t d ire ct o ry.

1.5.6 File-Handling System Calls Wh e n a u s e r a cce s s e s t h e co n t e n t s o f e it h e r a re g u la r file o r a d ire ct o ry, h e a ct u a lly a cce s s e s s o m e d a t a s t o re d in a h a rd wa re b lo ck d e vice . In t h is s e n s e , a file s ys t e m is a u s e rle ve l vie w o f t h e p h ys ica l o rg a n iza t io n o f a h a rd d is k p a rt it io n . S in ce a p ro ce s s in Us e r Mo d e ca n n o t d ire ct ly in t e ra ct wit h t h e lo w- le ve l h a rd wa re co m p o n e n t s , e a ch a ct u a l file o p e ra t io n m u s t b e p e rfo rm e d in Ke rn e l Mo d e . Th e re fo re , t h e Un ix o p e ra t in g s ys t e m d e fin e s s e ve ra l s ys t e m ca lls re la t e d t o file h a n d lin g . All Un ix ke rn e ls d e vo t e g re a t a t t e n t io n t o t h e e fficie n t h a n d lin g o f h a rd wa re b lo ck d e vice s t o a ch ie ve g o o d o ve ra ll s ys t e m p e rfo rm a n ce . In t h e ch a p t e rs t h a t fo llo w, we will d e s crib e t o p ics re la t e d t o file h a n d lin g in Lin u x a n d s p e cifica lly h o w t h e ke rn e l re a ct s t o file - re la t e d s ys t e m ca lls . To u n d e rs t a n d t h o s e d e s crip t io n s , yo u will n e e d t o kn o w h o w t h e m a in file - h a n d lin g s ys t e m ca lls a re u s e d ; t h e s e a re d e s crib e d in t h e n e xt s e ct io n .

1.5.6.1 Opening a file Pro ce s s e s ca n a cce s s o n ly "o p e n e d " file s . To o p e n a file , t h e p ro ce s s in vo ke s t h e s ys t e m ca ll:

fd = open(path, flag, mode) Th e t h re e p a ra m e t e rs h a ve t h e fo llo win g m e a n in g s :

path De n o t e s t h e p a t h n a m e ( re la t ive o r a b s o lu t e ) o f t h e file t o b e o p e n e d .

flag S p e cifie s h o w t h e file m u s t b e o p e n e d ( e . g . , re a d , writ e , re a d / writ e , a p p e n d ) . It ca n a ls o s p e cify wh e t h e r a n o n e xis t in g file s h o u ld b e cre a t e d .

mode S p e cifie s t h e a cce s s rig h t s o f a n e wly cre a t e d file . Th is s ys t e m ca ll cre a t e s a n "o p e n file " o b je ct a n d re t u rn s a n id e n t ifie r ca lle d a file d e s crip t o r. An o p e n file o b je ct co n t a in s : ●



S o m e file - h a n d lin g d a t a s t ru ct u re s , s u ch a s a p o in t e r t o t h e ke rn e l b u ffe r m e m o ry a re a wh e re file d a t a will b e co p ie d , a n offset fie ld t h a t d e n o t e s t h e cu rre n t p o s it io n in t h e file fro m wh ich t h e n e xt o p e ra t io n will t a ke p la ce ( t h e s o - ca lle d file p o in t e r) , a nd so on. S o m e p o in t e rs t o ke rn e l fu n ct io n s t h a t t h e p ro ce s s ca n in vo ke . Th e s e t o f p e rm it t e d fu n ct io n s d e p e n d s o n t h e va lu e o f t h e flag p a ra m e t e r.

We d is cu s s o p e n file o b je ct s in d e t a il in Ch a p t e r 1 2 . Le t 's lim it o u rs e lve s h e re t o d e s crib in g s o m e g e n e ra l p ro p e rt ie s s p e cifie d b y t h e POS IX s e m a n t ics . ●



A file d e s crip t o r re p re s e n t s a n in t e ra ct io n b e t we e n a p ro ce s s a n d a n o p e n e d file , wh ile a n o p e n file o b je ct co n t a in s d a t a re la t e d t o t h a t in t e ra ct io n . Th e s a m e o p e n file o b je ct m a y b e id e n t ifie d b y s e ve ra l file d e s crip t o rs in t h e s a m e p ro ce s s . S e ve ra l p ro ce s s e s m a y co n cu rre n t ly o p e n t h e s a m e file . In t h is ca s e , t h e file s ys t e m a s s ig n s a s e p a ra t e file d e s crip t o r t o e a ch file , a lo n g wit h a s e p a ra t e o p e n file o b je ct . Wh e n t h is o ccu rs , t h e Un ix file s ys t e m d o e s n o t p ro vid e a n y kin d o f s yn ch ro n iza t io n a m o n g t h e I/ O o p e ra t io n s is s u e d b y t h e p ro ce s s e s o n t h e s a m e file . Ho we ve r, s e ve ra l s ys t e m ca lls s u ch a s flock( ) a re a va ila b le t o a llo w p ro ce s s e s t o s yn ch ro n ize t h e m s e lve s o n t h e e n t ire file o r o n p o rt io n s o f it ( s e e Ch a p t e r 1 2 ) .

To cre a t e a n e w file , t h e p ro ce s s m a y a ls o in vo ke t h e creat( ) s ys t e m ca ll, wh ich is h a n d le d b y t h e ke rn e l e xa ct ly like open( ).

1.5.6.2 Accessing an opened file Re g u la r Un ix file s ca n b e a d d re s s e d e it h e r s e q u e n t ia lly o r ra n d o m ly, wh ile d e vice file s a n d n a m e d p ip e s a re u s u a lly a cce s s e d s e q u e n t ia lly ( s e e Ch a p t e r 1 3 ) . In b o t h kin d s o f a cce s s , t h e ke rn e l s t o re s t h e file p o in t e r in t h e o p e n file o b je ct — t h a t is , t h e cu rre n t p o s it io n a t wh ich t h e n e xt re a d o r writ e o p e ra t io n will t a ke p la ce . S e q u e n t ia l a cce s s is im p licit ly a s s u m e d : t h e read( ) a n d write( ) s ys t e m ca lls a lwa ys re fe r t o t h e p o s it io n o f t h e cu rre n t file p o in t e r. To m o d ify t h e va lu e , a p ro g ra m m u s t e xp licit ly in vo ke t h e lseek( ) s ys t e m ca ll. Wh e n a file is o p e n e d , t h e ke rn e l s e t s t h e file p o in t e r t o t h e p o s it io n o f t h e firs t b yt e in t h e file ( o ffs e t 0 ) . Th e lseek( ) s ys t e m ca ll re q u ire s t h e fo llo win g p a ra m e t e rs :

newoffset = lseek(fd, offset, whence); wh ich h a ve t h e fo llo win g m e a n in g s :

fd In d ica t e s t h e file d e s crip t o r o f t h e o p e n e d file

offset S p e cifie s a s ig n e d in t e g e r va lu e t h a t will b e u s e d fo r co m p u t in g t h e n e w p o s it io n o f t h e file p o in t e r

whence S p e cifie s wh e t h e r t h e n e w p o s it io n s h o u ld b e co m p u t e d b y a d d in g t h e offset va lu e t o t h e n u m b e r 0 ( o ffs e t fro m t h e b e g in n in g o f t h e file ) , t h e cu rre n t file p o in t e r, o r t h e p o s it io n o f t h e la s t b yt e ( o ffs e t fro m t h e e n d o f t h e file ) Th e read( ) s ys t e m ca ll re q u ire s t h e fo llo win g p a ra m e t e rs :

nread = read(fd, buf, count); wh ich h a ve t h e fo llo win g m e a n in g :

fd In d ica t e s t h e file d e s crip t o r o f t h e o p e n e d file

buf S p e cifie s t h e a d d re s s o f t h e b u ffe r in t h e p ro ce s s 's a d d re s s s p a ce t o wh ich t h e d a t a will b e t ra n s fe rre d

count De n o t e s t h e n u m b e r o f b yt e s t o re a d Wh e n h a n d lin g s u ch a s ys t e m ca ll, t h e ke rn e l a t t e m p t s t o re a d count b yt e s fro m t h e file h a vin g t h e file d e s crip t o r fd, s t a rt in g fro m t h e cu rre n t va lu e o f t h e o p e n e d file 's o ffs e t fie ld . In s o m e ca s e s —e n d - o f- file , e m p t y p ip e , a n d s o o n —t h e ke rn e l d o e s n o t s u cce e d in re a d in g a ll count b yt e s . Th e re t u rn e d nread va lu e s p e cifie s t h e n u m b e r o f b yt e s e ffe ct ive ly re a d . Th e file p o in t e r is a ls o u p d a t e d b y a d d in g nread t o it s p re vio u s va lu e . Th e write( ) p a ra m e t e rs a re s im ila r.

1.5.6.3 Closing a file

Wh e n a p ro ce s s d o e s n o t n e e d t o a cce s s t h e co n t e n t s o f a file a n ym o re , it ca n in vo ke t h e s ys t e m ca ll:

res = close(fd); wh ich re le a s e s t h e o p e n file o b je ct co rre s p o n d in g t o t h e file d e s crip t o r fd. Wh e n a p ro ce s s t e rm in a t e s , t h e ke rn e l clo s e s a ll it s re m a in in g o p e n e d file s .

1.5.6.4 Renaming and deleting a file To re n a m e o r d e le t e a file , a p ro ce s s d o e s n o t n e e d t o o p e n it . In d e e d , s u ch o p e ra t io n s d o n o t a ct o n t h e co n t e n t s o f t h e a ffe ct e d file , b u t ra t h e r o n t h e co n t e n t s o f o n e o r m o re d ire ct o rie s . Fo r e xa m p le , t h e s ys t e m ca ll:

res = rename(oldpath, newpath); ch a n g e s t h e n a m e o f a file lin k, wh ile t h e s ys t e m ca ll:

res = unlink(pathname); d e cre m e n t s t h e file lin k co u n t a n d re m o ve s t h e co rre s p o n d in g d ire ct o ry e n t ry. Th e file is d e le t e d o n ly wh e n t h e lin k co u n t a s s u m e s t h e va lu e 0 .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

1.6 An Overview of Unix Kernels Un ix ke rn e ls p ro vid e a n e xe cu t io n e n viro n m e n t in wh ich a p p lica t io n s m a y ru n . Th e re fo re , t h e ke rn e l m u s t im p le m e n t a s e t o f s e rvice s a n d co rre s p o n d in g in t e rfa ce s . Ap p lica t io n s u s e t h o s e in t e rfa ce s a n d d o n o t u s u a lly in t e ra ct d ire ct ly wit h h a rd wa re re s o u rce s .

1.6.1 The Process/Kernel Model As a lre a d y m e n t io n e d , a CPU ca n ru n in e it h e r Us e r Mo d e o r Ke rn e l Mo d e . Act u a lly, s o m e CPUs ca n h a ve m o re t h a n t wo e xe cu t io n s t a t e s . Fo r in s t a n ce , t h e 8 0 x 8 6 m icro p ro ce s s o rs h a ve fo u r d iffe re n t e xe cu t io n s t a t e s . Bu t a ll s t a n d a rd Un ix ke rn e ls u s e o n ly Ke rn e l Mo d e a n d Us e r Mo d e . Wh e n a p ro g ra m is e xe cu t e d in Us e r Mo d e , it ca n n o t d ire ct ly a cce s s t h e ke rn e l d a t a s t ru ct u re s o r t h e ke rn e l p ro g ra m s . Wh e n a n a p p lica t io n e xe cu t e s in Ke rn e l Mo d e , h o we ve r, t h e s e re s t rict io n s n o lo n g e r a p p ly. Ea ch CPU m o d e l p ro vid e s s p e cia l in s t ru ct io n s t o s wit ch fro m Us e r Mo d e t o Ke rn e l Mo d e a n d vice ve rs a . A p ro g ra m u s u a lly e xe cu t e s in Us e r Mo d e a n d s wit ch e s t o Ke rn e l Mo d e o n ly wh e n re q u e s t in g a s e rvice p ro vid e d b y t h e ke rn e l. Wh e n t h e ke rn e l h a s s a t is fie d t h e p ro g ra m 's re q u e s t , it p u t s t h e p ro g ra m b a ck in Us e r Mo d e . Pro ce s s e s a re d yn a m ic e n t it ie s t h a t u s u a lly h a ve a lim it e d life s p a n wit h in t h e s ys t e m . Th e t a s k o f cre a t in g , e lim in a t in g , a n d s yn ch ro n izin g t h e e xis t in g p ro ce s s e s is d e le g a t e d t o a g ro u p o f ro u t in e s in t h e ke rn e l. Th e ke rn e l it s e lf is n o t a p ro ce s s b u t a p ro ce s s m a n a g e r. Th e p ro ce s s / ke rn e l m o d e l a s s u m e s t h a t p ro ce s s e s t h a t re q u ire a ke rn e l s e rvice u s e s p e cific p ro g ra m m in g co n s t ru ct s ca lle d s y s t e m ca lls . Ea ch s ys t e m ca ll s e t s u p t h e g ro u p o f p a ra m e t e rs t h a t id e n t ifie s t h e p ro ce s s re q u e s t a n d t h e n e xe cu t e s t h e h a rd wa re - d e p e n d e n t CPU in s t ru ct io n t o s wit ch fro m Us e r Mo d e t o Ke rn e l Mo d e . Be s id e s u s e r p ro ce s s e s , Un ix s ys t e m s in clu d e a fe w p rivile g e d p ro ce s s e s ca lle d k e rn e l t h re a d s wit h t h e fo llo win g ch a ra ct e ris t ics : ● ● ●

Th e y ru n in Ke rn e l Mo d e in t h e ke rn e l a d d re s s s p a ce . Th e y d o n o t in t e ra ct wit h u s e rs , a n d t h u s d o n o t re q u ire t e rm in a l d e vice s . Th e y a re u s u a lly cre a t e d d u rin g s ys t e m s t a rt u p a n d re m a in a live u n t il t h e s ys t e m is s h u t d o wn .

On a u n ip ro ce s s o r s ys t e m , o n ly o n e p ro ce s s is ru n n in g a t a t im e a n d it m a y ru n e it h e r in Us e r o r in Ke rn e l Mo d e . If it ru n s in Ke rn e l Mo d e , t h e p ro ce s s o r is e xe cu t in g s o m e ke rn e l ro u t in e . Fig u re 1 - 3 illu s t ra t e s e xa m p le s o f t ra n s it io n s b e t we e n Us e r a n d Ke rn e l Mo d e . Pro ce s s 1 in Us e r Mo d e is s u e s a s ys t e m ca ll, a ft e r wh ich t h e p ro ce s s s wit ch e s t o Ke rn e l Mo d e a n d t h e s ys t e m ca ll is s e rvice d . Pro ce s s 1 t h e n re s u m e s e xe cu t io n in Us e r Mo d e u n t il a t im e r in t e rru p t o ccu rs a n d t h e s ch e d u le r is a ct iva t e d in Ke rn e l Mo d e . A p ro ce s s s wit ch t a ke s p la ce a n d Pro ce s s 2 s t a rt s it s e xe cu t io n in Us e r Mo d e u n t il a h a rd wa re d e vice ra is e s a n in t e rru p t . As a co n s e q u e n ce o f t h e in t e rru p t , Pro ce s s 2 s wit ch e s t o Ke rn e l Mo d e a n d s e rvice s t h e in t e rru p t . Fig u re 1 - 3 . Tra n s it io n s b e t w e e n Us e r a n d Ke rn e l Mo d e

Un ix ke rn e ls d o m u ch m o re t h a n h a n d le s ys t e m ca lls ; in fa ct , ke rn e l ro u t in e s ca n b e a ct iva t e d in s e ve ra l wa ys : ● ●





A p ro ce s s in vo ke s a s ys t e m ca ll. Th e CPU e xe cu t in g t h e p ro ce s s s ig n a ls a n e x ce p t io n , wh ich is a n u n u s u a l co n d it io n s u ch a s a n in va lid in s t ru ct io n . Th e ke rn e l h a n d le s t h e e xce p t io n o n b e h a lf o f t h e p ro ce s s t h a t ca u s e d it . A p e rip h e ra l d e vice is s u e s a n in t e rru p t s ig n a l t o t h e CPU t o n o t ify it o f a n e ve n t s u ch a s a re q u e s t fo r a t t e n t io n , a s t a t u s ch a n g e , o r t h e co m p le t io n o f a n I/ O o p e ra t io n . Ea ch in t e rru p t s ig n a l is d e a lt b y a ke rn e l p ro g ra m ca lle d a n in t e rru p t h a n d le r. S in ce p e rip h e ra l d e vice s o p e ra t e a s yn ch ro n o u s ly wit h re s p e ct t o t h e CPU, in t e rru p t s o ccu r a t u n p re d ict a b le t im e s . A ke rn e l t h re a d is e xe cu t e d . S in ce it ru n s in Ke rn e l Mo d e , t h e co rre s p o n d in g p ro g ra m m u s t b e co n s id e re d p a rt o f t h e ke rn e l.

1.6.2 Process Implementation To le t t h e ke rn e l m a n a g e p ro ce s s e s , e a ch p ro ce s s is re p re s e n t e d b y a p ro ce s s d e s crip t o r t h a t in clu d e s in fo rm a t io n a b o u t t h e cu rre n t s t a t e o f t h e p ro ce s s . Wh e n t h e ke rn e l s t o p s t h e e xe cu t io n o f a p ro ce s s , it s a ve s t h e cu rre n t co n t e n t s o f s e ve ra l p ro ce s s o r re g is t e rs in t h e p ro ce s s d e s crip t o r. Th e s e in clu d e : ● ● ● ●



Th e p ro g ra m co u n t e r ( PC) a n d s t a ck p o in t e r ( S P) re g is t e rs Th e g e n e ra l p u rp o s e re g is t e rs Th e flo a t in g p o in t re g is t e rs Th e p ro ce s s o r co n t ro l re g is t e rs ( Pro ce s s o r S t a t u s Wo rd ) co n t a in in g in fo rm a t io n a b o u t t h e CPU s t a t e Th e m e m o ry m a n a g e m e n t re g is t e rs u s e d t o ke e p t ra ck o f t h e RAM a cce s s e d b y t h e p ro ce s s

Wh e n t h e ke rn e l d e cid e s t o re s u m e e xe cu t in g a p ro ce s s , it u s e s t h e p ro p e r p ro ce s s d e s crip t o r fie ld s t o lo a d t h e CPU re g is t e rs . S in ce t h e s t o re d va lu e o f t h e p ro g ra m co u n t e r p o in t s t o t h e in s t ru ct io n fo llo win g t h e la s t in s t ru ct io n e xe cu t e d , t h e p ro ce s s re s u m e s e xe cu t io n a t t h e p o in t wh e re it wa s s t o p p e d . Wh e n a p ro ce s s is n o t e xe cu t in g o n t h e CPU, it is wa it in g fo r s o m e e ve n t . Un ix ke rn e ls d is t in g u is h m a n y wa it s t a t e s , wh ich a re u s u a lly im p le m e n t e d b y q u e u e s o f p ro ce s s d e s crip t o rs ; e a ch ( p o s s ib ly e m p t y) q u e u e co rre s p o n d s t o t h e s e t o f p ro ce s s e s wa it in g fo r a s p e cific e ve n t .

1.6.3 Reentrant Kernels

All Un ix ke rn e ls a re re e n t ra n t . Th is m e a n s t h a t s e ve ra l p ro ce s s e s m a y b e e xe cu t in g in Ke rn e l Mo d e a t t h e s a m e t im e . Of co u rs e , o n u n ip ro ce s s o r s ys t e m s , o n ly o n e p ro ce s s ca n p ro g re s s , b u t m a n y ca n b e b lo cke d in Ke rn e l Mo d e wh e n wa it in g fo r t h e CPU o r t h e co m p le t io n o f s o m e I/ O o p e ra t io n . Fo r in s t a n ce , a ft e r is s u in g a re a d t o a d is k o n b e h a lf o f s o m e p ro ce s s , t h e ke rn e l le t s t h e d is k co n t ro lle r h a n d le it , a n d re s u m e s e xe cu t in g o t h e r p ro ce s s e s . An in t e rru p t n o t ifie s t h e ke rn e l wh e n t h e d e vice h a s s a t is fie d t h e re a d , s o t h e fo rm e r p ro ce s s ca n re s u m e t h e e xe cu t io n . On e wa y t o p ro vid e re e n t ra n cy is t o writ e fu n ct io n s s o t h a t t h e y m o d ify o n ly lo ca l va ria b le s a n d d o n o t a lt e r g lo b a l d a t a s t ru ct u re s . S u ch fu n ct io n s a re ca lle d re e n t ra n t fu n ct io n s . Bu t a re e n t ra n t ke rn e l is n o t lim it e d ju s t t o s u ch re e n t ra n t fu n ct io n s ( a lt h o u g h t h a t is h o w s o m e re a lt im e ke rn e ls a re im p le m e n t e d ) . In s t e a d , t h e ke rn e l ca n in clu d e n o n re e n t ra n t fu n ct io n s a n d u s e lo ckin g m e ch a n is m s t o e n s u re t h a t o n ly o n e p ro ce s s ca n e xe cu t e a n o n re e n t ra n t fu n ct io n a t a t im e . Eve ry p ro ce s s in Ke rn e l Mo d e a ct s o n it s o wn s e t o f m e m o ry lo ca t io n s a n d ca n n o t in t e rfe re wit h t h e o t h e rs . If a h a rd wa re in t e rru p t o ccu rs , a re e n t ra n t ke rn e l is a b le t o s u s p e n d t h e cu rre n t ru n n in g p ro ce s s e ve n if t h a t p ro ce s s is in Ke rn e l Mo d e . Th is ca p a b ilit y is ve ry im p o rt a n t , s in ce it im p ro ve s t h e t h ro u g h p u t o f t h e d e vice co n t ro lle rs t h a t is s u e in t e rru p t s . On ce a d e vice h a s is s u e d a n in t e rru p t , it wa it s u n t il t h e CPU a ckn o wle d g e s it . If t h e ke rn e l is a b le t o a n s we r q u ickly, t h e d e vice co n t ro lle r will b e a b le t o p e rfo rm o t h e r t a s ks wh ile t h e CPU h a n d le s t h e in t e rru p t . No w le t 's lo o k a t ke rn e l re e n t ra n cy a n d it s im p a ct o n t h e o rg a n iza t io n o f t h e ke rn e l. A k e rn e l co n t ro l p a t h d e n o t e s t h e s e q u e n ce o f in s t ru ct io n s e xe cu t e d b y t h e ke rn e l t o h a n d le a s ys t e m ca ll, a n e xce p t io n , o r a n in t e rru p t . In t h e s im p le s t ca s e , t h e CPU e xe cu t e s a ke rn e l co n t ro l p a t h s e q u e n t ia lly fro m t h e firs t in s t ru ct io n t o t h e la s t . Wh e n o n e o f t h e fo llo win g e ve n t s o ccu rs , h o we ve r, t h e CPU in t e rle a ve s t h e ke rn e l co n t ro l p a t h s : ●





A p ro ce s s e xe cu t in g in Us e r Mo d e in vo ke s a s ys t e m ca ll, a n d t h e co rre s p o n d in g ke rn e l co n t ro l p a t h ve rifie s t h a t t h e re q u e s t ca n n o t b e s a t is fie d im m e d ia t e ly; it t h e n in vo ke s t h e s ch e d u le r t o s e le ct a n e w p ro ce s s t o ru n . As a re s u lt , a p ro ce s s s wit ch o ccu rs . Th e firs t ke rn e l co n t ro l p a t h is le ft u n fin is h e d a n d t h e CPU re s u m e s t h e e xe cu t io n o f s o m e o t h e r ke rn e l co n t ro l p a t h . In t h is ca s e , t h e t wo co n t ro l p a t h s a re e xe cu t e d o n b e h a lf o f t wo d iffe re n t p ro ce s s e s . Th e CPU d e t e ct s a n e xce p t io n —fo r e xa m p le , a cce s s t o a p a g e n o t p re s e n t in RAM—wh ile ru n n in g a ke rn e l co n t ro l p a t h . Th e firs t co n t ro l p a t h is s u s p e n d e d , a n d t h e CPU s t a rt s t h e e xe cu t io n o f a s u it a b le p ro ce d u re . In o u r e xa m p le , t h is t yp e o f p ro ce d u re ca n a llo ca t e a n e w p a g e fo r t h e p ro ce s s a n d re a d it s co n t e n t s fro m d is k. Wh e n t h e p ro ce d u re t e rm in a t e s , t h e firs t co n t ro l p a t h ca n b e re s u m e d . In t h is ca s e , t h e t wo co n t ro l p a t h s a re e xe cu t e d o n b e h a lf o f t h e s a m e p ro ce s s . A h a rd wa re in t e rru p t o ccu rs wh ile t h e CPU is ru n n in g a ke rn e l co n t ro l p a t h wit h t h e in t e rru p t s e n a b le d . Th e firs t ke rn e l co n t ro l p a t h is le ft u n fin is h e d a n d t h e CPU s t a rt s p ro ce s s in g a n o t h e r ke rn e l co n t ro l p a t h t o h a n d le t h e in t e rru p t . Th e firs t ke rn e l co n t ro l p a t h re s u m e s wh e n t h e in t e rru p t h a n d le r t e rm in a t e s . In t h is ca s e , t h e t wo ke rn e l co n t ro l p a t h s ru n in t h e e xe cu t io n co n t e xt o f t h e s a m e p ro ce s s , a n d t h e t o t a l e la p s e d s ys t e m t im e is a cco u n t e d t o it . Ho we ve r, t h e in t e rru p t h a n d le r d o e s n 't n e ce s s a rily o p e ra t e o n b e h a lf o f t h e p ro ce s s .

Fig u re 1 - 4 illu s t ra t e s a fe w e xa m p le s o f n o n in t e rle a ve d a n d in t e rle a ve d ke rn e l co n t ro l p a t h s . Th re e d iffe re n t CPU s t a t e s a re co n s id e re d : ●

Ru n n in g a p ro ce s s in Us e r Mo d e ( Us e r)

● ●

Ru n n in g a n e xce p t io n o r a s ys t e m ca ll h a n d le r ( Excp ) Ru n n in g a n in t e rru p t h a n d le r ( In t r) Fig u re 1 - 4 . I n t e rle a v in g o f k e rn e l c o n t ro l p a t h s

1.6.4 Process Address Space Ea ch p ro ce s s ru n s in it s p riva t e a d d re s s s p a ce . A p ro ce s s ru n n in g in Us e r Mo d e re fe rs t o p riva t e s t a ck, d a t a , a n d co d e a re a s . Wh e n ru n n in g in Ke rn e l Mo d e , t h e p ro ce s s a d d re s s e s t h e ke rn e l d a t a a n d co d e a re a a n d u s e s a n o t h e r s t a ck. S in ce t h e ke rn e l is re e n t ra n t , s e ve ra l ke rn e l co n t ro l p a t h s —e a ch re la t e d t o a d iffe re n t p ro ce s s —m a y b e e xe cu t e d in t u rn . In t h is ca s e , e a ch ke rn e l co n t ro l p a t h re fe rs t o it s o wn p riva t e ke rn e l s t a ck. Wh ile it a p p e a rs t o e a ch p ro ce s s t h a t it h a s a cce s s t o a p riva t e a d d re s s s p a ce , t h e re a re t im e s wh e n p a rt o f t h e a d d re s s s p a ce is s h a re d a m o n g p ro ce s s e s . In s o m e ca s e s , t h is s h a rin g is e xp licit ly re q u e s t e d b y p ro ce s s e s ; in o t h e rs , it is d o n e a u t o m a t ica lly b y t h e ke rn e l t o re d u ce m e m o ry u s a g e . If t h e s a m e p ro g ra m , s a y a n e d it o r, is n e e d e d s im u lt a n e o u s ly b y s e ve ra l u s e rs , t h e p ro g ra m is lo a d e d in t o m e m o ry o n ly o n ce , a n d it s in s t ru ct io n s ca n b e s h a re d b y a ll o f t h e u s e rs wh o n e e d it . It s d a t a , o f co u rs e , m u s t n o t b e s h a re d b e ca u s e e a ch u s e r will h a ve s e p a ra t e d a t a . Th is kin d o f s h a re d a d d re s s s p a ce is d o n e a u t o m a t ica lly b y t h e ke rn e l t o s a ve m e m o ry. Pro ce s s e s ca n a ls o s h a re p a rt s o f t h e ir a d d re s s s p a ce a s a kin d o f in t e rp ro ce s s co m m u n ica t io n , u s in g t h e "s h a re d m e m o ry" t e ch n iq u e in t ro d u ce d in S ys t e m V a n d s u p p o rt e d b y Lin u x. Fin a lly, Lin u x s u p p o rt s t h e mmap( ) s ys t e m ca ll, wh ich a llo ws p a rt o f a file o r t h e m e m o ry re s id in g o n a d e vice t o b e m a p p e d in t o a p a rt o f a p ro ce s s a d d re s s s p a ce . Me m o ry m a p p in g ca n p ro vid e a n a lt e rn a t ive t o n o rm a l re a d s a n d writ e s fo r t ra n s fe rrin g d a t a . If t h e s a m e file is s h a re d b y s e ve ra l p ro ce s s e s , it s m e m o ry m a p p in g is in clu d e d in t h e a d d re s s s p a ce o f e a ch o f t h e p ro ce s s e s t h a t s h a re it .

1.6.5 Synchronization and Critical Regions Im p le m e n t in g a re e n t ra n t ke rn e l re q u ire s t h e u s e o f s yn ch ro n iza t io n . If a ke rn e l co n t ro l p a t h is s u s p e n d e d wh ile a ct in g o n a ke rn e l d a t a s t ru ct u re , n o o t h e r ke rn e l co n t ro l p a t h s h o u ld b e a llo we d t o a ct o n t h e s a m e d a t a s t ru ct u re u n le s s it h a s b e e n re s e t t o a co n s is t e n t s t a t e . Ot h e rwis e , t h e in t e ra ct io n o f t h e t wo co n t ro l p a t h s co u ld co rru p t t h e s t o re d in fo rm a t io n .

Fo r e xa m p le , s u p p o s e a g lo b a l va ria b le V co n t a in s t h e n u m b e r o f a va ila b le it e m s o f s o m e s ys t e m re s o u rce . Th e firs t ke rn e l co n t ro l p a t h , A, re a d s t h e va ria b le a n d d e t e rm in e s t h a t t h e re is ju s t o n e a va ila b le it e m . At t h is p o in t , a n o t h e r ke rn e l co n t ro l p a t h , B, is a ct iva t e d a n d re a d s t h e s a m e va ria b le , wh ich s t ill co n t a in s t h e va lu e 1 . Th u s , B d e cre m e n t s V a n d s t a rt s u s in g t h e re s o u rce it e m . Th e n A re s u m e s t h e e xe cu t io n ; b e ca u s e it h a s a lre a d y re a d t h e va lu e o f V, it a s s u m e s t h a t it ca n d e cre m e n t V a n d t a ke t h e re s o u rce it e m , wh ich B a lre a d y u s e s . As a fin a l re s u lt , V co n t a in s - 1 , a n d t wo ke rn e l co n t ro l p a t h s u s e t h e s a m e re s o u rce it e m wit h p o t e n t ia lly d is a s t ro u s e ffe ct s . Wh e n t h e o u t co m e o f s o m e co m p u t a t io n d e p e n d s o n h o w t wo o r m o re p ro ce s s e s a re s ch e d u le d , t h e co d e is in co rre ct . We s a y t h a t t h e re is a ra ce co n d it io n . In g e n e ra l, s a fe a cce s s t o a g lo b a l va ria b le is e n s u re d b y u s in g a t o m ic o p e ra t io n s . In t h e p re vio u s e xa m p le , d a t a co rru p t io n is n o t p o s s ib le if t h e t wo co n t ro l p a t h s re a d a n d d e cre m e n t V wit h a s in g le , n o n in t e rru p t ib le o p e ra t io n . Ho we ve r, ke rn e ls co n t a in m a n y d a t a s t ru ct u re s t h a t ca n n o t b e a cce s s e d wit h a s in g le o p e ra t io n . Fo r e xa m p le , it u s u a lly is n 't p o s s ib le t o re m o ve a n e le m e n t fro m a lin ke d lis t wit h a s in g le o p e ra t io n b e ca u s e t h e ke rn e l n e e d s t o a cce s s a t le a s t t wo p o in t e rs a t o n ce . An y s e ct io n o f co d e t h a t s h o u ld b e fin is h e d b y e a ch p ro ce s s t h a t b e g in s it b e fo re a n o t h e r p ro ce s s ca n e n t e r it is ca lle d a crit ica l re g io n . [ 1 0 ] [10]

S yn ch ro n iza t io n p ro b le m s h a ve b e e n fu lly d e s crib e d in o t h e r wo rks ; we re fe r t h e in t e re s t e d re a d e r t o b o o ks o n t h e Un ix o p e ra t in g s ys t e m s ( s e e t h e b ib lio g ra p h y) .

Th e s e p ro b le m s o ccu r n o t o n ly a m o n g ke rn e l co n t ro l p a t h s , b u t a ls o a m o n g p ro ce s s e s s h a rin g co m m o n d a t a . S e ve ra l s yn ch ro n iza t io n t e ch n iq u e s h a ve b e e n a d o p t e d . Th e fo llo win g s e ct io n co n ce n t ra t e s o n h o w t o s yn ch ro n ize ke rn e l co n t ro l p a t h s .

1.6.5.1 Nonpreemptive kernels In s e a rch o f a d ra s t ica lly s im p le s o lu t io n t o s yn ch ro n iza t io n p ro b le m s , m o s t t ra d it io n a l Un ix ke rn e ls a re n o n p re e m p t ive : wh e n a p ro ce s s e xe cu t e s in Ke rn e l Mo d e , it ca n n o t b e a rb it ra rily s u s p e n d e d a n d s u b s t it u t e d wit h a n o t h e r p ro ce s s . Th e re fo re , o n a u n ip ro ce s s o r s ys t e m , a ll ke rn e l d a t a s t ru ct u re s t h a t a re n o t u p d a t e d b y in t e rru p t s o r e xce p t io n h a n d le rs a re s a fe fo r t h e ke rn e l t o a cce s s . Of co u rs e , a p ro ce s s in Ke rn e l Mo d e ca n vo lu n t a rily re lin q u is h t h e CPU, b u t in t h is ca s e , it m u s t e n s u re t h a t a ll d a t a s t ru ct u re s a re le ft in a co n s is t e n t s t a t e . Mo re o ve r, wh e n it re s u m e s it s e xe cu t io n , it m u s t re ch e ck t h e va lu e o f a n y p re vio u s ly a cce s s e d d a t a s t ru ct u re s t h a t co u ld b e ch a n g e d . No n p re e m p t a b ilit y is in e ffe ct ive in m u lt ip ro ce s s o r s ys t e m s , s in ce t wo ke rn e l co n t ro l p a t h s ru n n in g o n d iffe re n t CPUs ca n co n cu rre n t ly a cce s s t h e s a m e d a t a s t ru ct u re .

1.6.5.2 Interrupt disabling An o t h e r s yn ch ro n iza t io n m e ch a n is m fo r u n ip ro ce s s o r s ys t e m s co n s is t s o f d is a b lin g a ll h a rd wa re in t e rru p t s b e fo re e n t e rin g a crit ica l re g io n a n d re e n a b lin g t h e m rig h t a ft e r le a vin g it . Th is m e ch a n is m , wh ile s im p le , is fa r fro m o p t im a l. If t h e crit ica l re g io n is la rg e , in t e rru p t s ca n re m a in d is a b le d fo r a re la t ive ly lo n g t im e , p o t e n t ia lly ca u s in g a ll h a rd wa re a ct ivit ie s t o fre e ze . Mo re o ve r, o n a m u lt ip ro ce s s o r s ys t e m , t h is m e ch a n is m d o e s n 't wo rk a t a ll. Th e re is n o wa y t o e n s u re t h a t n o o t h e r CPU ca n a cce s s t h e s a m e d a t a s t ru ct u re s t h a t a re u p d a t e d in t h e p ro t e ct e d crit ica l re g io n .

1.6.5.3 Semaphores A wid e ly u s e d m e ch a n is m , e ffe ct ive in b o t h u n ip ro ce s s o r a n d m u lt ip ro ce s s o r s ys t e m s , re lie s o n t h e u s e o f s e m a p h o re s . A s e m a p h o re is s im p ly a co u n t e r a s s o cia t e d wit h a d a t a s t ru ct u re ; it is ch e cke d b y a ll ke rn e l t h re a d s b e fo re t h e y t ry t o a cce s s t h e d a t a s t ru ct u re . Ea ch s e m a p h o re m a y b e vie we d a s a n o b je ct co m p o s e d o f: ● ● ●

An in t e g e r va ria b le A lis t o f wa it in g p ro ce s s e s Two a t o m ic m e t h o d s : down( ) a n d up( )

Th e down( ) m e t h o d d e cre m e n t s t h e va lu e o f t h e s e m a p h o re . If t h e n e w va lu e is le s s t h a n 0 , t h e m e t h o d a d d s t h e ru n n in g p ro ce s s t o t h e s e m a p h o re lis t a n d t h e n b lo cks ( i. e . , in vo ke s t h e s ch e d u le r) . Th e up( ) m e t h o d in cre m e n t s t h e va lu e o f t h e s e m a p h o re a n d , if it s n e w va lu e is g re a t e r t h a n o r e q u a l t o 0 , re a ct iva t e s o n e o r m o re p ro ce s s e s in t h e s e m a p h o re lis t . Ea ch d a t a s t ru ct u re t o b e p ro t e ct e d h a s it s o wn s e m a p h o re , wh ich is in it ia lize d t o 1 . Wh e n a ke rn e l co n t ro l p a t h wis h e s t o a cce s s t h e d a t a s t ru ct u re , it e xe cu t e s t h e down( ) m e t h o d o n t h e p ro p e r s e m a p h o re . If t h e va lu e o f t h e n e w s e m a p h o re is n 't n e g a t ive , a cce s s t o t h e d a t a s t ru ct u re is g ra n t e d . Ot h e rwis e , t h e p ro ce s s t h a t is e xe cu t in g t h e ke rn e l co n t ro l p a t h is a d d e d t o t h e s e m a p h o re lis t a n d b lo cke d . Wh e n a n o t h e r p ro ce s s e xe cu t e s t h e up( ) m e t h o d o n t h a t s e m a p h o re , o n e o f t h e p ro ce s s e s in t h e s e m a p h o re lis t is a llo we d t o p ro ce e d .

1.6.5.4 Spin locks In m u lt ip ro ce s s o r s ys t e m s , s e m a p h o re s a re n o t a lwa ys t h e b e s t s o lu t io n t o t h e s yn ch ro n iza t io n p ro b le m s . S o m e ke rn e l d a t a s t ru ct u re s s h o u ld b e p ro t e ct e d fro m b e in g co n cu rre n t ly a cce s s e d b y ke rn e l co n t ro l p a t h s t h a t ru n o n d iffe re n t CPUs . In t h is ca s e , if t h e t im e re q u ire d t o u p d a t e t h e d a t a s t ru ct u re is s h o rt , a s e m a p h o re co u ld b e ve ry in e fficie n t . To ch e ck a s e m a p h o re , t h e ke rn e l m u s t in s e rt a p ro ce s s in t h e s e m a p h o re lis t a n d t h e n s u s p e n d it . S in ce b o t h o p e ra t io n s a re re la t ive ly e xp e n s ive , in t h e t im e it t a ke s t o co m p le t e t h e m , t h e o t h e r ke rn e l co n t ro l p a t h co u ld h a ve a lre a d y re le a s e d t h e s e m a p h o re . In t h e s e ca s e s , m u lt ip ro ce s s o r o p e ra t in g s ys t e m s u s e s p in lo ck s . A s p in lo ck is ve ry s im ila r t o a s e m a p h o re , b u t it h a s n o p ro ce s s lis t ; wh e n a p ro ce s s fin d s t h e lo ck clo s e d b y a n o t h e r p ro ce s s , it "s p in s " a ro u n d re p e a t e d ly, e xe cu t in g a t ig h t in s t ru ct io n lo o p u n t il t h e lo ck b e co m e s o p e n . Of co u rs e , s p in lo cks a re u s e le s s in a u n ip ro ce s s o r e n viro n m e n t . Wh e n a ke rn e l co n t ro l p a t h t rie s t o a cce s s a lo cke d d a t a s t ru ct u re , it s t a rt s a n e n d le s s lo o p . Th e re fo re , t h e ke rn e l co n t ro l p a t h t h a t is u p d a t in g t h e p ro t e ct e d d a t a s t ru ct u re wo u ld n o t h a ve a ch a n ce t o co n t in u e t h e e xe cu t io n a n d re le a s e t h e s p in lo ck. Th e fin a l re s u lt wo u ld b e t h a t t h e s ys t e m h a n g s .

1.6.5.5 Avoiding deadlocks Pro ce s s e s o r ke rn e l co n t ro l p a t h s t h a t s yn ch ro n ize wit h o t h e r co n t ro l p a t h s m a y e a s ily e n t e r a d e a d lo ck e d s t a t e . Th e s im p le s t ca s e o f d e a d lo ck o ccu rs wh e n p ro ce s s p 1 g a in s a cce s s t o d a t a s t ru ct u re a a n d p ro ce s s p 2 g a in s a cce s s t o b , b u t p 1 t h e n wa it s fo r b a n d p 2 wa it s fo r a . Ot h e r m o re co m p le x cyclic wa it s a m o n g g ro u p s o f p ro ce s s e s m a y a ls o o ccu r. Of co u rs e , a d e a d lo ck co n d it io n ca u s e s a co m p le t e fre e ze o f t h e a ffe ct e d p ro ce s s e s o r ke rn e l co n t ro l p a t h s . As fa r a s ke rn e l d e s ig n is co n ce rn e d , d e a d lo cks b e co m e a n is s u e wh e n t h e n u m b e r o f ke rn e l s e m a p h o re s u s e d is h ig h . In t h is ca s e , it m a y b e q u it e d ifficu lt t o e n s u re t h a t n o d e a d lo ck s t a t e will e ve r b e re a ch e d fo r a ll p o s s ib le wa ys t o in t e rle a ve ke rn e l co n t ro l p a t h s . S e ve ra l o p e ra t in g

s ys t e m s , in clu d in g Lin u x, a vo id t h is p ro b le m b y in t ro d u cin g a ve ry lim it e d n u m b e r o f s e m a p h o re s a n d re q u e s t in g s e m a p h o re s in a n a s ce n d in g o rd e r.

1.6.6 Signals and Interprocess Communication Un ix s ig n a ls p ro vid e a m e ch a n is m fo r n o t ifyin g p ro ce s s e s o f s ys t e m e ve n t s . Ea ch e ve n t h a s it s o wn s ig n a l n u m b e r, wh ich is u s u a lly re fe rre d t o b y a s ym b o lic co n s t a n t s u ch a s SIGTERM. Th e re a re t wo kin d s o f s ys t e m e ve n t s : As y n ch ro n o u s n o t ifica t io n s Fo r in s t a n ce , a u s e r ca n s e n d t h e in t e rru p t s ig n a l SIGINT t o a fo re g ro u n d p ro ce s s b y p re s s in g t h e in t e rru p t ke yco d e ( u s u a lly CTRL- C) a t t h e t e rm in a l. S y n ch ro n o u s e rro rs o r e x ce p t io n s Fo r in s t a n ce , t h e ke rn e l s e n d s t h e s ig n a l SIGSEGV t o a p ro ce s s wh e n it a cce s s e s a m e m o ry lo ca t io n a t a n ille g a l a d d re s s . Th e POS IX s t a n d a rd d e fin e s a b o u t 2 0 d iffe re n t s ig n a ls , t wo o f wh ich a re u s e r- d e fin a b le a n d m a y b e u s e d a s a p rim it ive m e ch a n is m fo r co m m u n ica t io n a n d s yn ch ro n iza t io n a m o n g p ro ce s s e s in Us e r Mo d e . In g e n e ra l, a p ro ce s s m a y re a ct t o a s ig n a l d e live ry in t wo p o s s ib le wa ys : ● ●

Ig n o re t h e s ig n a l. As yn ch ro n o u s ly e xe cu t e a s p e cifie d p ro ce d u re ( t h e s ig n a l h a n d le r) .

If t h e p ro ce s s d o e s n o t s p e cify o n e o f t h e s e a lt e rn a t ive s , t h e ke rn e l p e rfo rm s a d e fa u lt a ct io n t h a t d e p e n d s o n t h e s ig n a l n u m b e r. Th e five p o s s ib le d e fa u lt a ct io n s a re : ● ●

● ● ●

Te rm in a t e t h e p ro ce s s . Writ e t h e e xe cu t io n co n t e xt a n d t h e co n t e n t s o f t h e a d d re s s s p a ce in a file ( co re d u m p ) a n d t e rm in a t e t h e p ro ce s s . Ig n o re t h e s ig n a l. S u s p e n d t h e p ro ce s s . Re s u m e t h e p ro ce s s 's e xe cu t io n , if it wa s s t o p p e d .

Ke rn e l s ig n a l h a n d lin g is ra t h e r e la b o ra t e s in ce t h e POS IX s e m a n t ics a llo ws p ro ce s s e s t o t e m p o ra rily b lo ck s ig n a ls . Mo re o ve r, t h e SIGKILL a n d SIGSTOP s ig n a ls ca n n o t b e d ire ct ly h a n d le d b y t h e p ro ce s s o r ig n o re d . AT&T's Un ix S ys t e m V in t ro d u ce d o t h e r kin d s o f in t e rp ro ce s s co m m u n ica t io n a m o n g p ro ce s s e s in Us e r Mo d e , wh ich h a ve b e e n a d o p t e d b y m a n y Un ix ke rn e ls : s e m a p h o re s , m e s s a g e q u e u e s , a n d s h a re d m e m o ry . Th e y a re co lle ct ive ly kn o wn a s S y s t e m V IPC. Th e ke rn e l im p le m e n t s t h e s e co n s t ru ct s a s IPC re s o u rce s . A p ro ce s s a cq u ire s a re s o u rce b y in vo kin g a shmget( ), semget( ), o r msgget( ) s ys t e m ca ll. Ju s t like file s , IPC re s o u rce s a re p e rs is t e n t : t h e y m u s t b e e xp licit ly d e a llo ca t e d b y t h e cre a t o r p ro ce s s , b y t h e cu rre n t o wn e r, o r b y a s u p e ru s e r p ro ce s s . S e m a p h o re s a re s im ila r t o t h o s e d e s crib e d in S e ct io n 1 . 6 . 5 , e a rlie r in t h is ch a p t e r, e xce p t t h a t t h e y a re re s e rve d fo r p ro ce s s e s in Us e r Mo d e . Me s s a g e q u e u e s a llo w p ro ce s s e s t o e xch a n g e

m e s s a g e s b y u s in g t h e msgsnd( ) a n d msgget( ) s ys t e m ca lls , wh ich in s e rt a m e s s a g e in t o a s p e cific m e s s a g e q u e u e a n d e xt ra ct a m e s s a g e fro m it , re s p e ct ive ly. S h a re d m e m o ry p ro vid e s t h e fa s t e s t wa y fo r p ro ce s s e s t o e xch a n g e a n d s h a re d a t a . A p ro ce s s s t a rt s b y is s u in g a shmget( ) s ys t e m ca ll t o cre a t e a n e w s h a re d m e m o ry h a vin g a re q u ire d s ize . Aft e r o b t a in in g t h e IPC re s o u rce id e n t ifie r, t h e p ro ce s s in vo ke s t h e shmat( ) s ys t e m ca ll, wh ich re t u rn s t h e s t a rt in g a d d re s s o f t h e n e w re g io n wit h in t h e p ro ce s s a d d re s s s p a ce . Wh e n t h e p ro ce s s wis h e s t o d e t a ch t h e s h a re d m e m o ry fro m it s a d d re s s s p a ce , it in vo ke s t h e shmdt( ) s ys t e m ca ll. Th e im p le m e n t a t io n o f s h a re d m e m o ry d e p e n d s o n h o w t h e ke rn e l im p le m e n t s p ro ce s s a d d re s s s p a ce s .

1.6.7 Process Management Un ix m a ke s a n e a t d is t in ct io n b e t we e n t h e p ro ce s s a n d t h e p ro g ra m it is e xe cu t in g . To t h a t e n d , t h e fork( ) a n d _exit( ) s ys t e m ca lls a re u s e d re s p e ct ive ly t o cre a t e a n e w p ro ce s s a n d t o t e rm in a t e it , wh ile a n exec( )- like s ys t e m ca ll is in vo ke d t o lo a d a n e w p ro g ra m . Aft e r s u ch a s ys t e m ca ll is e xe cu t e d , t h e p ro ce s s re s u m e s e xe cu t io n wit h a b ra n d n e w a d d re s s s p a ce co n t a in in g t h e lo a d e d p ro g ra m . Th e p ro ce s s t h a t in vo ke s a fork( ) is t h e p a re n t , wh ile t h e n e w p ro ce s s is it s ch ild . Pa re n t s a n d ch ild re n ca n fin d o n e a n o t h e r b e ca u s e t h e d a t a s t ru ct u re d e s crib in g e a ch p ro ce s s in clu d e s a p o in t e r t o it s im m e d ia t e p a re n t a n d p o in t e rs t o a ll it s im m e d ia t e ch ild re n . A n a ive im p le m e n t a t io n o f t h e fork( ) wo u ld re q u ire b o t h t h e p a re n t 's d a t a a n d t h e p a re n t 's co d e t o b e d u p lica t e d a n d a s s ig n t h e co p ie s t o t h e ch ild . Th is wo u ld b e q u it e t im e co n s u m in g . Cu rre n t ke rn e ls t h a t ca n re ly o n h a rd wa re p a g in g u n it s fo llo w t h e Co p y- On - Writ e a p p ro a ch , wh ich d e fe rs p a g e d u p lica t io n u n t il t h e la s t m o m e n t ( i. e . , u n t il t h e p a re n t o r t h e ch ild is re q u ire d t o writ e in t o a p a g e ) . We s h a ll d e s crib e h o w Lin u x im p le m e n t s t h is t e ch n iq u e in S e ct io n 8 . 4 . 4 . Th e _exit( ) s ys t e m ca ll t e rm in a t e s a p ro ce s s . Th e ke rn e l h a n d le s t h is s ys t e m ca ll b y re le a s in g t h e re s o u rce s o wn e d b y t h e p ro ce s s a n d s e n d in g t h e p a re n t p ro ce s s a SIGCHLD s ig n a l, wh ich is ig n o re d b y d e fa u lt .

1.6.7.1 Zombie processes Ho w ca n a p a re n t p ro ce s s in q u ire a b o u t t e rm in a t io n o f it s ch ild re n ? Th e wait( ) s ys t e m ca ll a llo ws a p ro ce s s t o wa it u n t il o n e o f it s ch ild re n t e rm in a t e s ; it re t u rn s t h e p ro ce s s ID ( PID) o f t h e t e rm in a t e d ch ild . Wh e n e xe cu t in g t h is s ys t e m ca ll, t h e ke rn e l ch e cks wh e t h e r a ch ild h a s a lre a d y t e rm in a t e d . A s p e cia l z o m b ie p ro ce s s s t a t e is in t ro d u ce d t o re p re s e n t t e rm in a t e d p ro ce s s e s : a p ro ce s s re m a in s in t h a t s t a t e u n t il it s p a re n t p ro ce s s e xe cu t e s a wait( ) s ys t e m ca ll o n it . Th e s ys t e m ca ll h a n d le r e xt ra ct s d a t a a b o u t re s o u rce u s a g e fro m t h e p ro ce s s d e s crip t o r fie ld s ; t h e p ro ce s s d e s crip t o r m a y b e re le a s e d o n ce t h e d a t a is co lle ct e d . If n o ch ild p ro ce s s h a s a lre a d y t e rm in a t e d wh e n t h e wait( ) s ys t e m ca ll is e xe cu t e d , t h e ke rn e l u s u a lly p u t s t h e p ro ce s s in a wa it s t a t e u n t il a ch ild t e rm in a t e s . Ma n y ke rn e ls a ls o im p le m e n t a waitpid( ) s ys t e m ca ll, wh ich a llo ws a p ro ce s s t o wa it fo r a s p e cific ch ild p ro ce s s . Ot h e r va ria n t s o f wait( ) s ys t e m ca lls a re a ls o q u it e co m m o n .

It 's g o o d p ra ct ice fo r t h e ke rn e l t o ke e p a ro u n d in fo rm a t io n o n a ch ild p ro ce s s u n t il t h e p a re n t is s u e s it s wait( ) ca ll, b u t s u p p o s e t h e p a re n t p ro ce s s t e rm in a t e s wit h o u t is s u in g t h a t ca ll? Th e in fo rm a t io n t a ke s u p va lu a b le m e m o ry s lo t s t h a t co u ld b e u s e d t o s e rve livin g p ro ce s s e s . Fo r e xa m p le , m a n y s h e lls a llo w t h e u s e r t o s t a rt a co m m a n d in t h e b a ckg ro u n d a n d t h e n lo g o u t . Th e p ro ce s s t h a t is ru n n in g t h e co m m a n d s h e ll t e rm in a t e s , b u t it s ch ild re n co n t in u e t h e ir e xe cu t io n . Th e s o lu t io n lie s in a s p e cia l s ys t e m p ro ce s s ca lle d in it , wh ich is cre a t e d d u rin g s ys t e m in it ia liza t io n . Wh e n a p ro ce s s t e rm in a t e s , t h e ke rn e l ch a n g e s t h e a p p ro p ria t e p ro ce s s d e s crip t o r p o in t e rs o f a ll t h e e xis t in g ch ild re n o f t h e t e rm in a t e d p ro ce s s t o m a ke t h e m b e co m e ch ild re n o f in it . Th is p ro ce s s m o n it o rs t h e e xe cu t io n o f a ll it s ch ild re n a n d ro u t in e ly is s u e s wait( ) s ys t e m ca lls , wh o s e s id e e ffe ct is t o g e t rid o f a ll zo m b ie s .

1.6.7.2 Process groups and login sessions Mo d e rn Un ix o p e ra t in g s ys t e m s in t ro d u ce t h e n o t io n o f p ro ce s s g ro u p s t o re p re s e n t a "jo b " a b s t ra ct io n . Fo r e xa m p le , in o rd e r t o e xe cu t e t h e co m m a n d lin e :

$ ls | sort | more a s h e ll t h a t s u p p o rt s p ro ce s s g ro u p s , s u ch a s bash, cre a t e s a n e w g ro u p fo r t h e t h re e p ro ce s s e s co rre s p o n d in g t o ls, sort, a n d more. In t h is wa y, t h e s h e ll a ct s o n t h e t h re e p ro ce s s e s a s if t h e y we re a s in g le e n t it y ( t h e jo b , t o b e p re cis e ) . Ea ch p ro ce s s d e s crip t o r in clu d e s a p ro ce s s g ro u p ID fie ld . Ea ch g ro u p o f p ro ce s s e s m a y h a ve a g ro u p le a d e r, wh ich is t h e p ro ce s s wh o s e PID co in cid e s wit h t h e p ro ce s s g ro u p ID. A n e wly cre a t e d p ro ce s s is in it ia lly in s e rt e d in t o t h e p ro ce s s g ro u p o f it s p a re n t . Mo d e rn Un ix ke rn e ls a ls o in t ro d u ce lo g in s e s s io n s . In fo rm a lly, a lo g in s e s s io n co n t a in s a ll p ro ce s s e s t h a t a re d e s ce n d a n t s o f t h e p ro ce s s t h a t h a s s t a rt e d a wo rkin g s e s s io n o n a s p e cific t e rm in a l—u s u a lly, t h e firs t co m m a n d s h e ll p ro ce s s cre a t e d fo r t h e u s e r. All p ro ce s s e s in a p ro ce s s g ro u p m u s t b e in t h e s a m e lo g in s e s s io n . A lo g in s e s s io n m a y h a ve s e ve ra l p ro ce s s g ro u p s a ct ive s im u lt a n e o u s ly; o n e o f t h e s e p ro ce s s g ro u p s is a lwa ys in t h e fo re g ro u n d , wh ich m e a n s t h a t it h a s a cce s s t o t h e t e rm in a l. Th e o t h e r a ct ive p ro ce s s g ro u p s a re in t h e b a ckg ro u n d . Wh e n a b a ckg ro u n d p ro ce s s t rie s t o a cce s s t h e t e rm in a l, it re ce ive s a SIGTTIN o r

SIGTTOUT s ig n a l. In m a n y co m m a n d s h e lls , t h e in t e rn a l co m m a n d s bg a n d fg ca n b e u s e d t o p u t a p ro ce s s g ro u p in e it h e r t h e b a ckg ro u n d o r t h e fo re g ro u n d .

1.6.8 Memory Management Me m o ry m a n a g e m e n t is b y fa r t h e m o s t co m p le x a ct ivit y in a Un ix ke rn e l. Mo re t h a n a t h ird o f t h is b o o k is d e d ica t e d ju s t t o d e s crib in g h o w Lin u x d o e s it . Th is s e ct io n illu s t ra t e s s o m e o f t h e m a in is s u e s re la t e d t o m e m o ry m a n a g e m e n t .

1.6.8.1 Virtual memory All re ce n t Un ix s ys t e m s p ro vid e a u s e fu l a b s t ra ct io n ca lle d v irt u a l m e m o ry . Virt u a l m e m o ry a ct s a s a lo g ica l la ye r b e t we e n t h e a p p lica t io n m e m o ry re q u e s t s a n d t h e h a rd wa re Me m o ry Ma n a g e m e n t Un it ( MMU) . Virt u a l m e m o ry h a s m a n y p u rp o s e s a n d a d va n t a g e s : ● ●

● ●

S e ve ra l p ro ce s s e s ca n b e e xe cu t e d co n cu rre n t ly. It is p o s s ib le t o ru n a p p lica t io n s wh o s e m e m o ry n e e d s a re la rg e r t h a n t h e a va ila b le p h ys ica l m e m o ry. Pro ce s s e s ca n e xe cu t e a p ro g ra m wh o s e co d e is o n ly p a rt ia lly lo a d e d in m e m o ry. Ea ch p ro ce s s is a llo we d t o a cce s s a s u b s e t o f t h e a va ila b le p h ys ica l m e m o ry.

● ●



Pro ce s s e s ca n s h a re a s in g le m e m o ry im a g e o f a lib ra ry o r p ro g ra m . Pro g ra m s ca n b e re lo ca t a b le — t h a t is , t h e y ca n b e p la ce d a n ywh e re in p h ys ica l m e m o ry. Pro g ra m m e rs ca n writ e m a ch in e - in d e p e n d e n t co d e , s in ce t h e y d o n o t n e e d t o b e co n ce rn e d a b o u t p h ys ica l m e m o ry o rg a n iza t io n .

Th e m a in in g re d ie n t o f a virt u a l m e m o ry s u b s ys t e m is t h e n o t io n o f v irt u a l a d d re s s s p a ce . Th e s e t o f m e m o ry re fe re n ce s t h a t a p ro ce s s ca n u s e is d iffe re n t fro m p h ys ica l m e m o ry a d d re s s e s . Wh e n a p ro ce s s u s e s a virt u a l a d d re s s , [ 1 1 ] t h e ke rn e l a n d t h e MMU co o p e ra t e t o lo ca t e t h e a ct u a l p h ys ica l lo ca t io n o f t h e re q u e s t e d m e m o ry it e m . [11]

Th e s e a d d re s s e s h a ve d iffe re n t n o m e n cla t u re s , d e p e n d in g o n t h e co m p u t e r a rch it e ct u re . As we 'll s e e in Ch a p t e r 2 , In t e l m a n u a ls re fe r t o t h e m a s "lo g ica l a d d re s s e s . " To d a y's CPUs in clu d e h a rd wa re circu it s t h a t a u t o m a t ica lly t ra n s la t e t h e virt u a l a d d re s s e s in t o p h ys ica l o n e s . To t h a t e n d , t h e a va ila b le RAM is p a rt it io n e d in t o p a g e fra m e s 4 o r 8 KB in le n g t h , a n d a s e t o f Pa g e Ta b le s is in t ro d u ce d t o s p e cify h o w virt u a l a d d re s s e s co rre s p o n d t o p h ys ica l a d d re s s e s . Th e s e circu it s m a ke m e m o ry a llo ca t io n s im p le r, s in ce a re q u e s t fo r a b lo ck o f co n t ig u o u s virt u a l a d d re s s e s ca n b e s a t is fie d b y a llo ca t in g a g ro u p o f p a g e fra m e s h a vin g n o n co n t ig u o u s p h ys ica l a d d re s s e s .

1.6.8.2 Random access memory usage All Un ix o p e ra t in g s ys t e m s cle a rly d is t in g u is h b e t we e n t wo p o rt io n s o f t h e ra n d o m a cce s s m e m o ry ( RAM) . A fe w m e g a b yt e s a re d e d ica t e d t o s t o rin g t h e ke rn e l im a g e ( i. e . , t h e ke rn e l co d e a n d t h e ke rn e l s t a t ic d a t a s t ru ct u re s ) . Th e re m a in in g p o rt io n o f RAM is u s u a lly h a n d le d b y t h e virt u a l m e m o ry s ys t e m a n d is u s e d in t h re e p o s s ib le wa ys : ●

● ●

To s a t is fy ke rn e l re q u e s t s fo r b u ffe rs , d e s crip t o rs , a n d o t h e r d yn a m ic ke rn e l d a t a s t ru ct u re s To s a t is fy p ro ce s s re q u e s t s fo r g e n e ric m e m o ry a re a s a n d fo r m e m o ry m a p p in g o f file s To g e t b e t t e r p e rfo rm a n ce fro m d is ks a n d o t h e r b u ffe re d d e vice s b y m e a n s o f ca ch e s

Ea ch re q u e s t t yp e is va lu a b le . On t h e o t h e r h a n d , s in ce t h e a va ila b le RAM is lim it e d , s o m e b a la n cin g a m o n g re q u e s t t yp e s m u s t b e d o n e , p a rt icu la rly wh e n lit t le a va ila b le m e m o ry is le ft . Mo re o ve r, wh e n s o m e crit ica l t h re s h o ld o f a va ila b le m e m o ry is re a ch e d a n d a p a g e - fra m e re cla im in g a lg o rit h m is in vo ke d t o fre e a d d it io n a l m e m o ry, wh ich a re t h e p a g e fra m e s m o s t s u it a b le fo r re cla im in g ? As we s h a ll s e e in Ch a p t e r 1 6 , t h e re is n o s im p le a n s we r t o t h is q u e s t io n a n d ve ry lit t le s u p p o rt fro m t h e o ry. Th e o n ly a va ila b le s o lu t io n lie s in d e ve lo p in g ca re fu lly t u n e d e m p irica l a lg o rit h m s . On e m a jo r p ro b le m t h a t m u s t b e s o lve d b y t h e virt u a l m e m o ry s ys t e m is m e m o ry fra g m e n t a t io n . Id e a lly, a m e m o ry re q u e s t s h o u ld fa il o n ly wh e n t h e n u m b e r o f fre e p a g e fra m e s is t o o s m a ll. Ho we ve r, t h e ke rn e l is o ft e n fo rce d t o u s e p h ys ica lly co n t ig u o u s m e m o ry a re a s , h e n ce t h e m e m o ry re q u e s t co u ld fa il e ve n if t h e re is e n o u g h m e m o ry a va ila b le b u t it is n o t a va ila b le a s o n e co n t ig u o u s ch u n k.

1.6.8.3 Kernel Memory Allocator Th e Ke rn e l Me m o ry Allo ca t o r ( KMA) is a s u b s ys t e m t h a t t rie s t o s a t is fy t h e re q u e s t s fo r m e m o ry a re a s fro m a ll p a rt s o f t h e s ys t e m . S o m e o f t h e s e re q u e s t s co m e fro m o t h e r ke rn e l s u b s ys t e m s n e e d in g m e m o ry fo r ke rn e l u s e , a n d s o m e re q u e s t s co m e via s ys t e m ca lls fro m u s e r p ro g ra m s t o in cre a s e t h e ir p ro ce s s e s ' a d d re s s s p a ce s . A g o o d KMA s h o u ld h a ve t h e

fo llo win g fe a t u re s : ●

● ● ●

It m u s t b e fa s t . Act u a lly, t h is is t h e m o s t cru cia l a t t rib u t e , s in ce it is in vo ke d b y a ll ke rn e l s u b s ys t e m s ( in clu d in g t h e in t e rru p t h a n d le rs ) . It s h o u ld m in im ize t h e a m o u n t o f wa s t e d m e m o ry. It s h o u ld t ry t o re d u ce t h e m e m o ry fra g m e n t a t io n p ro b le m . It s h o u ld b e a b le t o co o p e ra t e wit h t h e o t h e r m e m o ry m a n a g e m e n t s u b s ys t e m s t o b o rro w a n d re le a s e p a g e fra m e s fro m t h e m .

S e ve ra l p ro p o s e d KMAs , wh ich a re b a s e d o n a va rie t y o f d iffe re n t a lg o rit h m ic t e ch n iq u e s , in clu d e : ● ● ● ● ● ● ●

Re s o u rce m a p a llo ca t o r Po we r- o f- t wo fre e lis t s McKu s ick- Ka re ls a llo ca t o r Bu d d y s ys t e m Ma ch 's Zo n e a llo ca t o r Dyn ix a llo ca t o r S o la ris 's S la b a llo ca t o r

As we s h a ll s e e in Ch a p t e r 7 , Lin u x's KMA u s e s a S la b a llo ca t o r o n t o p o f a b u d d y s ys t e m .

1.6.8.4 Process virtual address space handling Th e a d d re s s s p a ce o f a p ro ce s s co n t a in s a ll t h e virt u a l m e m o ry a d d re s s e s t h a t t h e p ro ce s s is a llo we d t o re fe re n ce . Th e ke rn e l u s u a lly s t o re s a p ro ce s s virt u a l a d d re s s s p a ce a s a lis t o f m e m o ry a re a d e s crip t o rs . Fo r e xa m p le , wh e n a p ro ce s s s t a rt s t h e e xe cu t io n o f s o m e p ro g ra m via a n exec( )- like s ys t e m ca ll, t h e ke rn e l a s s ig n s t o t h e p ro ce s s a virt u a l a d d re s s s p a ce t h a t co m p ris e s m e m o ry a re a s fo r: ● ● ● ● ● ●

Th e Th e Th e Th e Th e Th e

e xe cu t a b le co d e o f t h e p ro g ra m in it ia lize d d a t a o f t h e p ro g ra m u n in it ia lize d d a t a o f t h e p ro g ra m in it ia l p ro g ra m s t a ck ( i. e . , t h e Us e r Mo d e s t a ck) e xe cu t a b le co d e a n d d a t a o f n e e d e d s h a re d lib ra rie s h e a p ( t h e m e m o ry d yn a m ica lly re q u e s t e d b y t h e p ro g ra m )

All re ce n t Un ix o p e ra t in g s ys t e m s a d o p t a m e m o ry a llo ca t io n s t ra t e g y ca lle d d e m a n d p a g in g . Wit h d e m a n d p a g in g , a p ro ce s s ca n s t a rt p ro g ra m e xe cu t io n wit h n o n e o f it s p a g e s in p h ys ica l m e m o ry. As it a cce s s e s a n o n p re s e n t p a g e , t h e MMU g e n e ra t e s a n e xce p t io n ; t h e e xce p t io n h a n d le r fin d s t h e a ffe ct e d m e m o ry re g io n , a llo ca t e s a fre e p a g e , a n d in it ia lize s it wit h t h e a p p ro p ria t e d a t a . In a s im ila r fa s h io n , wh e n t h e p ro ce s s d yn a m ica lly re q u ire s m e m o ry b y u s in g malloc( ) o r t h e brk( ) s ys t e m ca ll ( wh ich is in vo ke d in t e rn a lly b y malloc( )) , t h e ke rn e l ju s t u p d a t e s t h e s ize o f t h e h e a p m e m o ry re g io n o f t h e p ro ce s s . A p a g e fra m e is a s s ig n e d t o t h e p ro ce s s o n ly wh e n it g e n e ra t e s a n e xce p t io n b y t ryin g t o re fe r it s virt u a l m e m o ry a d d re s s e s . Virt u a l a d d re s s s p a ce s a ls o a llo w o t h e r e fficie n t s t ra t e g ie s , s u ch a s t h e Co p y- On - Writ e s t ra t e g y m e n t io n e d e a rlie r. Fo r e xa m p le , wh e n a n e w p ro ce s s is cre a t e d , t h e ke rn e l ju s t a s s ig n s t h e p a re n t 's p a g e fra m e s t o t h e ch ild a d d re s s s p a ce , b u t m a rks t h e m re a d - o n ly. An e xce p t io n is ra is e d a s s o o n t h e p a re n t o r t h e ch ild t rie s t o m o d ify t h e co n t e n t s o f a p a g e . Th e e xce p t io n h a n d le r a s s ig n s a n e w p a g e fra m e t o t h e a ffe ct e d p ro ce s s a n d in it ia lize s it wit h t h e co n t e n t s o f t h e o rig in a l p a g e .

1.6.8.5 Swapping and caching

To e xt e n d t h e s ize o f t h e virt u a l a d d re s s s p a ce u s a b le b y t h e p ro ce s s e s , t h e Un ix o p e ra t in g s ys t e m u s e s s w a p a re a s o n d is k. Th e virt u a l m e m o ry s ys t e m re g a rd s t h e co n t e n t s o f a p a g e fra m e a s t h e b a s ic u n it fo r s wa p p in g . Wh e n e ve r a p ro ce s s re fe rs t o a s wa p p e d - o u t p a g e , t h e MMU ra is e s a n e xce p t io n . Th e e xce p t io n h a n d le r t h e n a llo ca t e s a n e w p a g e fra m e a n d in it ia lize s t h e p a g e fra m e wit h it s o ld co n t e n t s s a ve d o n d is k. On t h e o t h e r h a n d , p h ys ica l m e m o ry is a ls o u s e d a s ca ch e fo r h a rd d is ks a n d o t h e r b lo ck d e vice s . Th is is b e ca u s e h a rd d rive s a re ve ry s lo w: a d is k a cce s s re q u ire s s e ve ra l m illis e co n d s , wh ich is a ve ry lo n g t im e co m p a re d wit h t h e RAM a cce s s t im e . Th e re fo re , d is ks a re o ft e n t h e b o t t le n e ck in s ys t e m p e rfo rm a n ce . As a g e n e ra l ru le , o n e o f t h e p o licie s a lre a d y im p le m e n t e d in t h e e a rlie s t Un ix s ys t e m is t o d e fe r writ in g t o d is k a s lo n g a s p o s s ib le b y lo a d in g in t o RAM a s e t o f d is k b u ffe rs t h a t co rre s p o n d t o b lo cks re a d fro m d is k. Th e sync( ) s ys t e m ca ll fo rce s d is k s yn ch ro n iza t io n b y writ in g a ll o f t h e "d irt y" b u ffe rs ( i. e . , a ll t h e b u ffe rs wh o s e co n t e n t s d iffe r fro m t h a t o f t h e co rre s p o n d in g d is k b lo cks ) in t o d is k. To a vo id d a t a lo s s , a ll o p e ra t in g s ys t e m s t a ke ca re t o p e rio d ica lly writ e d irt y b u ffe rs b a ck t o d is k.

1.6.9 Device Drivers Th e ke rn e l in t e ra ct s wit h I/ O d e vice s b y m e a n s o f d e v ice d riv e rs . De vice d rive rs a re in clu d e d in t h e ke rn e l a n d co n s is t o f d a t a s t ru ct u re s a n d fu n ct io n s t h a t co n t ro l o n e o r m o re d e vice s , s u ch a s h a rd d is ks , ke yb o a rd s , m o u s e s , m o n it o rs , n e t wo rk in t e rfa ce s , a n d d e vice s co n n e ct e d t o a S CS I b u s . Ea ch d rive r in t e ra ct s wit h t h e re m a in in g p a rt o f t h e ke rn e l ( e ve n wit h o t h e r d rive rs ) t h ro u g h a s p e cific in t e rfa ce . Th is a p p ro a ch h a s t h e fo llo win g a d va n t a g e s : ● ●





De vice - s p e cific co d e ca n b e e n ca p s u la t e d in a s p e cific m o d u le . Ve n d o rs ca n a d d n e w d e vice s wit h o u t kn o win g t h e ke rn e l s o u rce co d e ; o n ly t h e in t e rfa ce s p e cifica t io n s m u s t b e kn o wn . Th e ke rn e l d e a ls wit h a ll d e vice s in a u n ifo rm wa y a n d a cce s s e s t h e m t h ro u g h t h e s a m e in t e rfa ce . It is p o s s ib le t o writ e a d e vice d rive r a s a m o d u le t h a t ca n b e d yn a m ica lly lo a d e d in t h e ke rn e l wit h o u t re q u irin g t h e s ys t e m t o b e re b o o t e d . It is a ls o p o s s ib le t o d yn a m ica lly u n lo a d a m o d u le t h a t is n o lo n g e r n e e d e d , t h e re fo re m in im izin g t h e s ize o f t h e ke rn e l im a g e s t o re d in RAM.

Fig u re 1 - 5 illu s t ra t e s h o w d e vice d rive rs in t e rfa ce wit h t h e re s t o f t h e ke rn e l a n d wit h t h e p ro ce s s e s . Fig u re 1 - 5 . D e v ic e d riv e r in t e rfa c e

S o m e u s e r p ro g ra m s ( P) wis h t o o p e ra t e o n h a rd wa re d e vice s . Th e y m a ke re q u e s t s t o t h e ke rn e l u s in g t h e u s u a l file - re la t e d s ys t e m ca lls a n d t h e d e vice file s n o rm a lly fo u n d in t h e / d e v d ire ct o ry. Act u a lly, t h e d e vice file s a re t h e u s e r- vis ib le p o rt io n o f t h e d e vice d rive r in t e rfa ce . Ea ch d e vice file re fe rs t o a s p e cific d e vice d rive r, wh ich is in vo ke d b y t h e ke rn e l t o p e rfo rm t h e re q u e s t e d o p e ra t io n o n t h e h a rd wa re co m p o n e n t . At t h e t im e Un ix wa s in t ro d u ce d , g ra p h ica l t e rm in a ls we re u n co m m o n a n d e xp e n s ive , s o o n ly a lp h a n u m e ric t e rm in a ls we re h a n d le d d ire ct ly b y Un ix ke rn e ls . Wh e n g ra p h ica l t e rm in a ls b e ca m e wid e s p re a d , a d h o c a p p lica t io n s s u ch a s t h e X Win d o w S ys t e m we re in t ro d u ce d t h a t ra n a s s t a n d a rd p ro ce s s e s a n d a cce s s e d t h e I/ O p o rt s o f t h e g ra p h ics in t e rfa ce a n d t h e RAM vid e o a re a d ire ct ly. S o m e re ce n t Un ix ke rn e ls , s u ch a s Lin u x 2 . 4 , p ro vid e a n a b s t ra ct io n fo r t h e fra m e b u ffe r o f t h e g ra p h ic ca rd a n d a llo w a p p lica t io n s o ft wa re t o a cce s s t h e m wit h o u t n e e d in g t o kn o w a n yt h in g a b o u t t h e I/ O p o rt s o f t h e g ra p h ics in t e rfa ce ( s e e S e ct io n 1 3 . 3 . 1 . )

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 2. Memory Addressing Th is ch a p t e r d e a ls wit h a d d re s s in g t e ch n iq u e s . Lu ckily, a n o p e ra t in g s ys t e m is n o t fo rce d t o ke e p t ra ck o f p h ys ica l m e m o ry a ll b y it s e lf; t o d a y's m icro p ro ce s s o rs in clu d e s e ve ra l h a rd wa re circu it s t o m a ke m e m o ry m a n a g e m e n t b o t h m o re e fficie n t a n d m o re ro b u s t in ca s e o f p ro g ra m m in g e rro rs . As in t h e re s t o f t h is b o o k, we o ffe r d e t a ils in t h is ch a p t e r o n h o w 8 0 x 8 6 m icro p ro ce s s o rs a d d re s s m e m o ry ch ip s a n d h o w Lin u x u s e s t h e a va ila b le a d d re s s in g circu it s . Yo u will fin d , we h o p e , t h a t wh e n yo u le a rn t h e im p le m e n t a t io n d e t a ils o n Lin u x's m o s t p o p u la r p la t fo rm yo u will b e t t e r u n d e rs t a n d b o t h t h e g e n e ra l t h e o ry o f p a g in g a n d h o w t o re s e a rch t h e im p le m e n t a t io n o n o t h e r p la t fo rm s . Th is is t h e firs t o f t h re e ch a p t e rs re la t e d t o m e m o ry m a n a g e m e n t ; Ch a p t e r 7 d is cu s s e s h o w t h e ke rn e l a llo ca t e s m a in m e m o ry t o it s e lf, wh ile Ch a p t e r 8 co n s id e rs h o w lin e a r a d d re s s e s a re a s s ig n e d t o p ro ce s s e s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

2.1 Memory Addresses Pro g ra m m e rs ca s u a lly re fe r t o a m e m o ry a d d re s s a s t h e wa y t o a cce s s t h e co n t e n t s o f a m e m o ry ce ll. Bu t wh e n d e a lin g wit h 8 0 x 8 6 m icro p ro ce s s o rs , we h a ve t o d is t in g u is h t h re e kin d s o f a d d re s s e s : Lo g ica l a d d re s s In clu d e d in t h e m a ch in e la n g u a g e in s t ru ct io n s t o s p e cify t h e a d d re s s o f a n o p e ra n d o r o f a n in s t ru ct io n . Th is t yp e o f a d d re s s e m b o d ie s t h e we ll- kn o wn 8 0 x x8 6 s e g m e n t e d a rch it e ct u re t h a t fo rce s MS - DOS a n d Win d o ws p ro g ra m m e rs t o d ivid e t h e ir p ro g ra m s in t o s e g m e n t s . Ea ch lo g ica l a d d re s s co n s is t s o f a s e g m e n t a n d a n o ffs e t ( o r d is p la ce m e n t ) t h a t d e n o t e s t h e d is t a n ce fro m t h e s t a rt o f t h e s e g m e n t t o t h e a ct u a l a d d re s s . Lin e a r a d d re s s ( a ls o kn o wn a s virt u a l a d d re s s ) A s in g le 3 2 - b it u n s ig n e d in t e g e r t h a t ca n b e u s e d t o a d d re s s u p t o 4 GB — t h a t is , u p t o 4 , 2 9 4 , 9 6 7 , 2 9 6 m e m o ry ce lls . Lin e a r a d d re s s e s a re u s u a lly re p re s e n t e d in h e xa d e cim a l n o t a t io n ; t h e ir va lu e s ra n g e fro m 0x00000000 t o 0xffffffff.

Ph y s ica l a d d re s s Us e d t o a d d re s s m e m o ry ce lls in m e m o ry ch ip s . Th e y co rre s p o n d t o t h e e le ct rica l s ig n a ls s e n t a lo n g t h e a d d re s s p in s o f t h e m icro p ro ce s s o r t o t h e m e m o ry b u s . Ph ys ica l a d d re s s e s a re re p re s e n t e d a s 3 2 - b it u n s ig n e d in t e g e rs . Th e CPU co n t ro l u n it t ra n s fo rm s a lo g ica l a d d re s s in t o a lin e a r a d d re s s b y m e a n s o f a h a rd wa re circu it ca lle d a s e g m e n t a t io n u n it ; s u b s e q u e n t ly, a s e co n d h a rd wa re circu it ca lle d a p a g in g u n it t ra n s fo rm s t h e lin e a r a d d re s s in t o a p h ys ica l a d d re s s ( s e e Fig u re 2 - 1 ) . Fig u re 2 - 1 . Lo g ic a l a d d re s s t ra n s la t io n

In m u lt ip ro ce s s o r s ys t e m s , a ll CPUs s h a re t h e s a m e m e m o ry; t h is m e a n s t h a t RAM ch ip s m a y b e a cce s s e d co n cu rre n t ly b y in d e p e n d e n t CPUs . S in ce re a d o r writ e o p e ra t io n s o n a RAM ch ip m u s t b e p e rfo rm e d s e ria lly, a h a rd wa re circu it ca lle d a m e m o ry a rb it e r is in s e rt e d b e t we e n t h e b u s a n d e ve ry RAM ch ip . It s ro le is t o g ra n t a cce s s t o a CPU if t h e ch ip is fre e a n d t o d e la y it if t h e ch ip is b u s y s e rvicin g a re q u e s t b y a n o t h e r p ro ce s s o r. Eve n u n ip ro ce s s o r s ys t e m s u s e m e m o ry a rb it e rs , s in ce t h e y in clu d e a s p e cia lize d p ro ce s s o r ca lle d DMA t h a t o p e ra t e s co n cu rre n t ly wit h t h e CPU ( s e e S e ct io n 1 3 . 1 . 4 ) . In t h e ca s e o f m u lt ip ro ce s s o r s ys t e m s , t h e s t ru ct u re o f t h e a rb it e r is m o re co m p le x s in ce it h a s m o re in p u t p o rt s . Th e d u a l Pe n t iu m , fo r in s t a n ce , m a in t a in s a t wo - p o rt a rb it e r a t e a ch ch ip e n t ra n ce a n d re q u ire s t h a t t h e t wo CPUs e xch a n g e s yn ch ro n iza t io n m e s s a g e s b e fo re a t t e m p t in g t o u s e t h e co m m o n b u s . Fro m t h e p ro g ra m m in g p o in t o f vie w, t h e a rb it e r is h id d e n s in ce it is m a n a g e d b y h a rd wa re circu it s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

2.2 Segmentation in Hardware S t a rt in g wit h t h e 8 0 3 8 6 m o d e l, In t e l m icro p ro ce s s o rs p e rfo rm a d d re s s t ra n s la t io n in t wo d iffe re n t wa ys ca lle d re a l m o d e a n d p ro t e ct e d m o d e . Th e s e a re d e s crib e d in t h e n e xt s e ct io n s . Re a l m o d e e xis t s m o s t ly t o m a in t a in p ro ce s s o r co m p a t ib ilit y wit h o ld e r m o d e ls a n d t o a llo w t h e o p e ra t in g s ys t e m t o b o o t s t ra p ( s e e Ap p e n d ix A fo r a s h o rt d e s crip t io n o f re a l m ode ).

2.2.1 Segmentation Registers A lo g ica l a d d re s s co n s is t s o f t wo p a rt s : a s e g m e n t id e n t ifie r a n d a n o ffs e t t h a t s p e cifie s t h e re la t ive a d d re s s wit h in t h e s e g m e n t . Th e s e g m e n t id e n t ifie r is a 1 6 - b it fie ld ca lle d t h e S e g m e n t S e le ct o r, wh ile t h e o ffs e t is a 3 2 - b it fie ld . To m a ke it e a s y t o re t rie ve s e g m e n t s e le ct o rs q u ickly, t h e p ro ce s s o r p ro vid e s s e g m e n t a t io n re g is t e rs wh o s e o n ly p u rp o s e is t o h o ld S e g m e n t S e le ct o rs ; t h e s e re g is t e rs a re ca lle d cs,

ss, ds, es, fs, a n d gs. Alt h o u g h t h e re a re o n ly s ix o f t h e m , a p ro g ra m ca n re u s e t h e s a m e s e g m e n t a t io n re g is t e r fo r d iffe re n t p u rp o s e s b y s a vin g it s co n t e n t in m e m o ry a n d t h e n re s t o rin g it la t e r. Th re e o f t h e s ix s e g m e n t a t io n re g is t e rs h a ve s p e cific p u rp o s e s :

cs Th e co d e s e g m e n t re g is t e r, wh ich p o in t s t o a s e g m e n t co n t a in in g p ro g ra m in s t ru ct io n s

ss Th e s t a ck s e g m e n t re g is t e r, wh ich p o in t s t o a s e g m e n t co n t a in in g t h e cu rre n t p ro g ra m s t a ck

ds Th e d a t a s e g m e n t re g is t e r, wh ich p o in t s t o a s e g m e n t co n t a in in g s t a t ic a n d e xt e rn a l da ta Th e re m a in in g t h re e s e g m e n t a t io n re g is t e rs a re g e n e ra l p u rp o s e a n d m a y re fe r t o a rb it ra ry da ta se gm e nts. Th e cs re g is t e r h a s a n o t h e r im p o rt a n t fu n ct io n : it in clu d e s a 2 - b it fie ld t h a t s p e cifie s t h e Cu rre n t Privile g e Le ve l ( CPL) o f t h e CPU. Th e va lu e 0 d e n o t e s t h e h ig h e s t p rivile g e le ve l, wh ile t h e va lu e 3 d e n o t e s t h e lo we s t o n e . Lin u x u s e s o n ly le ve ls 0 a n d 3 , wh ich a re re s p e ct ive ly ca lle d Ke rn e l Mo d e a n d Us e r Mo d e .

2.2.2 Segment Descriptors Ea ch s e g m e n t is re p re s e n t e d b y a n 8 - b yt e S e g m e n t De s crip t o r ( s e e Fig u re 2 - 2 ) t h a t

d e s crib e s t h e s e g m e n t ch a ra ct e ris t ics . S e g m e n t De s crip t o rs a re s t o re d e it h e r in t h e Glo b a l De s crip t o r Ta b le ( GDT ) o r in t h e Lo ca l De s crip t o r Ta b le ( LDT ) . Fig u re 2 - 2 . S e g m e n t D e s c rip t o r fo rm a t

Us u a lly o n ly o n e GDT is d e fin e d , wh ile e a ch p ro ce s s is p e rm it t e d t o h a ve it s o wn LDT if it n e e d s t o cre a t e a d d it io n a l s e g m e n t s b e s id e s t h o s e s t o re d in t h e GDT. Th e a d d re s s o f t h e GDT in m a in m e m o ry is co n t a in e d in t h e gdtr p ro ce s s o r re g is t e r a n d t h e a d d re s s o f t h e cu rre n t ly u s e d LDT is co n t a in e d in t h e ldtr p ro ce s s o r re g is t e r.

Ea ch S e g m e n t De s crip t o r co n s is t s o f t h e fo llo win g fie ld s : ●

A 3 2 - b it Base fie ld t h a t co n t a in s t h e lin e a r a d d re s s o f t h e firs t b yt e o f t h e s e g m e n t .



A G g ra n u la rit y fla g . If it is cle a re d ( e q u a l t o 0 ) , t h e s e g m e n t s ize is e xp re s s e d in



b yt e s ; o t h e rwis e , it is e xp re s s e d in m u lt ip le s o f 4 0 9 6 b yt e s . A 2 0 - b it Limit fie ld t h a t d e n o t e s t h e s e g m e n t le n g t h in b yt e s ( s e g m e n t s t h a t h a ve a

Limit fie ld e q u a l t o ze ro a re co n s id e re d n u ll) . Wh e n G is s e t t o 0 , t h e s ize o f a n o n -



n u ll s e g m e n t m a y va ry b e t we e n 1 b yt e a n d 1 MB; o t h e rwis e , it m a y va ry b e t we e n 4 KB a n d 4 GB. An S s ys t e m fla g . If it is cle a re d , t h e s e g m e n t is a s ys t e m s e g m e n t t h a t s t o re s



ke rn e l d a t a s t ru ct u re s ; o t h e rwis e , it is a n o rm a l co d e o r d a t a s e g m e n t . A 4 - b it Type fie ld t h a t ch a ra ct e rize s t h e s e g m e n t t yp e a n d it s a cce s s rig h t s . Th e fo llo win g lis t s h o ws S e g m e n t De s crip t o r t yp e s t h a t a re wid e ly u s e d . Co d e S e g m e n t De s crip t o r

In d ica t e s t h a t t h e S e g m e n t De s crip t o r re fe rs t o a co d e s e g m e n t ; it m a y b e in clu d e d e it h e r in t h e GDT o r in t h e LDT. Th e d e s crip t o r h a s t h e S fla g s e t .

Da t a S e g m e n t De s crip t o r In d ica t e s t h a t t h e S e g m e n t De s crip t o r re fe rs t o a d a t a s e g m e n t ; it m a y b e in clu d e d e it h e r in t h e GDT o r in t h e LDT. Th e d e s crip t o r h a s t h e S fla g s e t . S t a ck s e g m e n t s a re im p le m e n t e d b y m e a n s o f g e n e ric d a t a s e g m e n t s . Ta s k S t a t e S e g m e n t De s crip t o r ( TS S D) In d ica t e s t h a t t h e S e g m e n t De s crip t o r re fe rs t o a Ta s k S t a t e S e g m e n t ( TS S ) — t h a t is , a s e g m e n t u s e d t o s a ve t h e co n t e n t s o f t h e p ro ce s s o r re g is t e rs ( s e e S e ct io n 3 . 3 . 2 ) ; it ca n a p p e a r o n ly in t h e GDT. Th e co rre s p o n d in g Type fie ld h a s t h e va lu e 1 1 o r 9 , d e p e n d in g o n wh e t h e r t h e co rre s p o n d in g p ro ce s s is cu rre n t ly e xe cu t in g o n a CPU. Th e S fla g o f s u ch d e s crip t o rs is s e t t o 0 .

Lo ca l De s crip t o r Ta b le De s crip t o r ( LDTD) In d ica t e s t h a t t h e S e g m e n t De s crip t o r re fe rs t o a s e g m e n t co n t a in in g a n LDT; it ca n a p p e a r o n ly in t h e GDT. Th e co rre s p o n d in g Type fie ld h a s t h e va lu e 2 . Th e S fla g o f s u ch d e s crip t o rs is s e t t o 0 . Th e n e xt s e ct io n s h o ws h o w 8 0 x 8 6 p ro ce s s o rs a re a b le t o d e cid e wh e t h e r a s e g m e n t d e s crip t o r is s t o re d in t h e GDT o r in t h e LDT o f t h e p ro ce s s . ●

A DPL ( De s crip t o r Priv ile g e Le v e l ) 2 - b it fie ld u s e d t o re s t rict a cce s s e s t o t h e s e g m e n t . It re p re s e n t s t h e m in im a l CPU p rivile g e le ve l re q u e s t e d fo r a cce s s in g t h e s e g m e n t . Th e re fo re , a s e g m e n t wit h it s DPL s e t t o 0 is a cce s s ib le o n ly wh e n t h e CPL is 0 — t h a t is , in Ke rn e l Mo d e — wh ile a s e g m e n t wit h it s DPL s e t t o 3 is a cce s s ib le wit h e ve ry CPL va lu e .



A Segment-Present fla g t h a t is e q u a l t o 0 if t h e s e g m e n t is cu rre n t ly n o t s t o re d in



m a in m e m o ry. Lin u x a lwa ys s e t s t h is fie ld t o 1 , s in ce it n e ve r s wa p s o u t wh o le s e g m e n t s t o d is k. An a d d it io n a l fla g ca lle d D o r B d e p e n d in g o n wh e t h e r t h e s e g m e n t co n t a in s co d e o r

● ●

d a t a . It s m e a n in g is s lig h t ly d iffe re n t in t h e t wo ca s e s , b u t it is b a s ica lly s e t ( e q u a l t o 1 ) if t h e a d d re s s e s u s e d a s s e g m e n t o ffs e t s a re 3 2 b it s lo n g a n d it is cle a re d if t h e y a re 1 6 b it s lo n g ( s e e t h e In t e l m a n u a l fo r fu rt h e r d e t a ils ) . A re s e rve d b it ( b it 5 3 ) a lwa ys s e t t o 0 . An AVL fla g t h a t m a y b e u s e d b y t h e o p e ra t in g s ys t e m b u t is ig n o re d in Lin u x.

2.2.3 Fast Access to Segment Descriptors We re ca ll t h a t lo g ica l a d d re s s e s co n s is t o f a 1 6 - b it S e g m e n t S e le ct o r a n d a 3 2 - b it Offs e t , a n d t h a t s e g m e n t a t io n re g is t e rs s t o re o n ly t h e S e g m e n t S e le ct o r. To s p e e d u p t h e t ra n s la t io n o f lo g ica l a d d re s s e s in t o lin e a r a d d re s s e s , t h e 8 0 x 8 6 p ro ce s s o r p ro vid e s a n a d d it io n a l n o n p ro g ra m m a b le re g is t e r—t h a t is , a re g is t e r t h a t ca n n o t b e s e t b y a p ro g ra m m e r—fo r e a ch o f t h e s ix p ro g ra m m a b le s e g m e n t a t io n re g is t e rs . Ea ch n o n p ro g ra m m a b le re g is t e r co n t a in s t h e 8 - b yt e S e g m e n t De s crip t o r ( d e s crib e d in t h e p re vio u s s e ct io n ) s p e cifie d b y t h e S e g m e n t S e le ct o r co n t a in e d in t h e co rre s p o n d in g s e g m e n t a t io n re g is t e r. Eve ry t im e a S e g m e n t S e le ct o r is lo a d e d in a s e g m e n t a t io n re g is t e r,

t h e co rre s p o n d in g S e g m e n t De s crip t o r is lo a d e d fro m m e m o ry in t o t h e m a t ch in g n o n p ro g ra m m a b le CPU re g is t e r. Fro m t h e n o n , t ra n s la t io n s o f lo g ica l a d d re s s e s re fe rrin g t o t h a t s e g m e n t ca n b e p e rfo rm e d wit h o u t a cce s s in g t h e GDT o r LDT s t o re d in m a in m e m o ry; t h e p ro ce s s o r ca n ju s t re fe r d ire ct ly t o t h e CPU re g is t e r co n t a in in g t h e S e g m e n t De s crip t o r. Acce s s e s t o t h e GDT o r LDT a re n e ce s s a ry o n ly wh e n t h e co n t e n t s o f t h e s e g m e n t a t io n re g is t e r ch a n g e ( s e e Fig u re 2 - 3 ) . Ea ch S e g m e n t S e le ct o r in clu d e s t h e fo llo win g fie ld s :





A 1 3 - b it in d e x ( d e s crib e d fu rt h e r in t h e t e xt fo llo win g t h is lis t ) t h a t id e n t ifie s t h e co rre s p o n d in g S e g m e n t De s crip t o r e n t ry co n t a in e d in t h e GDT o r in t h e LDT A TI ( Ta b le In d ica t o r) fla g t h a t s p e cifie s wh e t h e r t h e S e g m e n t De s crip t o r is in clu d e d in t h e GDT ( TI = 0 ) o r in t h e LDT ( TI = 1 )



An RPL ( Re q u e s t o r Priv ile g e Le v e l ) 2 - b it fie ld , wh ich is p re cis e ly t h e Cu rre n t Privile g e Le ve l o f t h e CPU wh e n t h e co rre s p o n d in g S e g m e n t S e le ct o r is lo a d e d in t o t h e cs re g is t e r [ 1 ]

[1]

Th e RPL fie ld m a y a ls o b e u s e d t o s e le ct ive ly we a ke n t h e p ro ce s s o r p rivile g e le ve l wh e n a cce s s in g d a t a s e g m e n t s ; s e e In t e l d o cu m e n t a t io n fo r d e t a ils . Fig u re 2 - 3 . S e g m e n t S e le c t o r a n d S e g m e n t D e s c rip t o r

S in ce a S e g m e n t De s crip t o r is 8 b yt e s lo n g , it s re la t ive a d d re s s in s id e t h e GDT o r t h e LDT is o b t a in e d b y m u lt ip lyin g t h e 1 3 - b it in d e x fie ld o f t h e S e g m e n t S e le ct o r b y 8 . Fo r in s t a n ce , if t h e GDT is a t 0x00020000 ( t h e va lu e s t o re d in t h e gdtr re g is t e r) a n d t h e in d e x s p e cifie d b y t h e S e g m e n t S e le ct o r is 2 , t h e a d d re s s o f t h e co rre s p o n d in g S e g m e n t De s crip t o r is 0x00020000 + (2 x 8), o r 0x00020010.

Th e firs t e n t ry o f t h e GDT is a lwa ys s e t t o 0 . Th is e n s u re s t h a t lo g ica l a d d re s s e s wit h a n u ll S e g m e n t S e le ct o r will b e co n s id e re d in va lid , t h u s ca u s in g a p ro ce s s o r e xce p t io n . Th e m a xim u m n u m b e r o f S e g m e n t De s crip t o rs t h a t ca n b e s t o re d in t h e GDT is 8 , 1 9 1 ( i. e . , , 2 1 3 1).

2.2.4 Segmentation Unit Fig u re 2 - 4 s h o ws in d e t a il h o w a lo g ica l a d d re s s is t ra n s la t e d in t o a co rre s p o n d in g lin e a r a d d re s s . Th e s e g m e n t a t io n u n it p e rfo rm s t h e fo llo win g o p e ra t io n s :



Exa m in e s t h e TI fie ld o f t h e S e g m e n t S e le ct o r t o d e t e rm in e wh ich De s crip t o r Ta b le s t o re s t h e S e g m e n t De s crip t o r. Th is fie ld in d ica t e s t h a t t h e De s crip t o r is e it h e r in t h e GDT ( in wh ich ca s e t h e s e g m e n t a t io n u n it g e t s t h e b a s e lin e a r a d d re s s o f t h e GDT fro m t h e gdtr re g is t e r) o r in t h e a ct ive LDT ( in wh ich ca s e t h e s e g m e n t a t io n u n it g e t s t h e b a s e lin e a r a d d re s s o f t h a t LDT fro m t h e ldtr re g is t e r) .



Co m p u t e s t h e a d d re s s o f t h e S e g m e n t De s crip t o r fro m t h e index fie ld o f t h e S e g m e n t S e le ct o r. Th e index fie ld is m u lt ip lie d b y 8 ( t h e s ize o f a S e g m e n t De s crip t o r) , a n d t h e re s u lt is a d d e d t o t h e co n t e n t o f t h e gdtr o r ldtr re g is t e r.



Ad d s t h e o ffs e t o f t h e lo g ica l a d d re s s t o t h e Base fie ld o f t h e S e g m e n t De s crip t o r, t h u s o b t a in in g t h e lin e a r a d d re s s . Fig u re 2 - 4 . Tra n s la t in g a lo g ic a l a d d re s s

No t ice t h a t , t h a n ks t o t h e n o n p ro g ra m m a b le re g is t e rs a s s o cia t e d wit h t h e s e g m e n t a t io n re g is t e rs , t h e firs t t wo o p e ra t io n s n e e d t o b e p e rfo rm e d o n ly wh e n a s e g m e n t a t io n re g is t e r h a s b e e n ch a n g e d . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

2.3 Segmentation in Linux S e g m e n t a t io n h a s b e e n in clu d e d in 8 0 x 8 6 m icro p ro ce s s o rs t o e n co u ra g e p ro g ra m m e rs t o s p lit t h e ir a p p lica t io n s in t o lo g ica lly re la t e d e n t it ie s , s u ch a s s u b ro u t in e s o r g lo b a l a n d lo ca l d a t a a re a s . Ho we ve r, Lin u x u s e s s e g m e n t a t io n in a ve ry lim it e d wa y. In fa ct , s e g m e n t a t io n a n d p a g in g a re s o m e wh a t re d u n d a n t s in ce b o t h ca n b e u s e d t o s e p a ra t e t h e p h ys ica l a d d re s s s p a ce s o f p ro ce s s e s : s e g m e n t a t io n ca n a s s ig n a d iffe re n t lin e a r a d d re s s s p a ce t o e a ch p ro ce s s , wh ile p a g in g ca n m a p t h e s a m e lin e a r a d d re s s s p a ce in t o d iffe re n t p h ys ica l a d d re s s s p a ce s . Lin u x p re fe rs p a g in g t o s e g m e n t a t io n fo r t h e fo llo win g re a s o n s : ●



Me m o ry m a n a g e m e n t is s im p le r wh e n a ll p ro ce s s e s u s e t h e s a m e s e g m e n t re g is t e r va lu e s — t h a t is , wh e n t h e y s h a re t h e s a m e s e t o f lin e a r a d d re s s e s . On e o f t h e d e s ig n o b je ct ive s o f Lin u x is p o rt a b ilit y t o a wid e ra n g e o f a rch it e ct u re s ; RIS C a rch it e ct u re s in p a rt icu la r h a ve lim it e d s u p p o rt fo r s e g m e n t a t io n .

Th e 2 . 4 ve rs io n o f Lin u x u s e s s e g m e n t a t io n o n ly wh e n re q u ire d b y t h e 8 0 x 8 6 a rch it e ct u re . In p a rt icu la r, a ll p ro ce s s e s u s e t h e s a m e lo g ica l a d d re s s e s , s o t h e t o t a l n u m b e r o f s e g m e n t s t o b e d e fin e d is q u it e lim it e d , a n d it is p o s s ib le t o s t o re a ll S e g m e n t De s crip t o rs in t h e Glo b a l De s crip t o r Ta b le ( GDT) . Th is t a b le is im p le m e n t e d b y t h e a rra y gdt_table re fe rre d t o b y t h e gdt va ria b le .

Lo ca l De s crip t o r Ta b le s a re n o t u s e d b y t h e ke rn e l, a lt h o u g h a s ys t e m ca ll ca lle d

modify_ldt( ) e xis t s t h a t a llo ws p ro ce s s e s t o cre a t e t h e ir o wn LDTs . Th is t u rn s o u t t o b e u s e fu l t o a p p lica t io n s ( s u ch a s Win e ) t h a t e xe cu t e s e g m e n t - o rie n t e d Micro s o ft Win d o ws a p p lica t io n s . He re a re t h e s e g m e n t s u s e d b y Lin u x: ●

A ke rn e l co d e s e g m e n t . Th e fie ld s o f t h e co rre s p o n d in g S e g m e n t De s crip t o r in t h e GDT h a ve t h e fo llo win g va lu e s : ❍ ❍ ❍ ❍ ❍ ❍ ❍

Base = 0x00000000 Limit = 0xfffff G ( g ra n u la rit y fla g ) = 1, fo r s e g m e n t s ize e xp re s s e d in p a g e s S ( s ys t e m fla g ) = 1, fo r n o rm a l co d e o r d a t a s e g m e n t Type = 0xa, fo r co d e s e g m e n t t h a t ca n b e re a d a n d e xe cu t e d DPL ( De s crip t o r Privile g e Le ve l) = 0, fo r Ke rn e l Mo d e D/B ( 3 2 - b it a d d re s s fla g ) = 1, fo r 3 2 - b it o ffs e t a d d re s s e s

Th u s , t h e lin e a r a d d re s s e s a s s o cia t e d wit h t h a t s e g m e n t s t a rt a t 0 a n d re a ch t h e a d d re s s in g lim it o f 2 3 2 - 1 . Th e S a n d Type fie ld s s p e cify t h a t t h e s e g m e n t is a co d e s e g m e n t t h a t ca n b e re a d a n d e xe cu t e d . It s DPL va lu e is 0 , s o it ca n b e a cce s s e d o n ly in Ke rn e l Mo d e . Th e co rre s p o n d in g S e g m e n t S e le ct o r is d e fin e d b y t h e _

_KERNEL_CS m a cro . To a d d re s s t h e s e g m e n t , t h e ke rn e l ju s t lo a d s t h e va lu e yie ld e d b y t h e m a cro in t o t h e cs re g is t e r. ●

A ke rn e l d a t a s e g m e n t . Th e fie ld s o f t h e co rre s p o n d in g S e g m e n t De s crip t o r in t h e GDT h a ve t h e fo llo win g va lu e s :

❍ ❍ ❍ ❍ ❍ ❍ ❍

Base = 0x00000000 Limit = 0xfffff G ( g ra n u la rit y fla g ) = 1, fo r s e g m e n t s ize e xp re s s e d in p a g e s S ( s ys t e m fla g ) = 1, fo r n o rm a l co d e o r d a t a s e g m e n t Type = 2, fo r d a t a s e g m e n t t h a t ca n b e re a d a n d writ t e n DPL ( De s crip t o r Privile g e Le ve l) = 0, fo r Ke rn e l Mo d e D/B ( 3 2 - b it a d d re s s fla g ) = 1, fo r 3 2 - b it o ffs e t a d d re s s e s

Th is s e g m e n t is id e n t ica l t o t h e p re vio u s o n e ( in fa ct , t h e y o ve rla p in t h e lin e a r a d d re s s s p a ce ) , e xce p t fo r t h e va lu e o f t h e Type fie ld , wh ich s p e cifie s t h a t it is a d a t a s e g m e n t t h a t ca n b e re a d a n d writ t e n . Th e co rre s p o n d in g S e g m e n t S e le ct o r is d e fin e d b y t h e _ _KERNEL_DS m a cro .



A u s e r co d e s e g m e n t s h a re d b y a ll p ro ce s s e s in Us e r Mo d e . Th e fie ld s o f t h e co rre s p o n d in g S e g m e n t De s crip t o r in t h e GDT h a ve t h e fo llo win g va lu e s : ❍ ❍ ❍ ❍ ❍ ❍ ❍

Base = 0x00000000 Limit = 0xfffff G ( g ra n u la rit y fla g ) = 1, fo r s e g m e n t s ize e xp re s s e d in p a g e s S ( s ys t e m fla g ) = 1, fo r n o rm a l co d e o r d a t a s e g m e n t Type = 0xa, fo r co d e s e g m e n t t h a t ca n b e re a d a n d e xe cu t e d DPL ( De s crip t o r Privile g e Le ve l) = 3, fo r Us e r Mo d e D/B ( 3 2 - b it a d d re s s fla g ) = 1, fo r 3 2 - b it o ffs e t a d d re s s e s

Th e S a n d DPL fie ld s s p e cify t h a t t h e s e g m e n t is n o t a s ys t e m s e g m e n t a n d it s p rivile g e le ve l is e q u a l t o 3 ; it ca n t h u s b e a cce s s e d b o t h in Ke rn e l Mo d e a n d in Us e r Mo d e . Th e co rre s p o n d in g S e g m e n t S e le ct o r is d e fin e d b y t h e _ _USER_CS m a cro .



A u s e r d a t a s e g m e n t s h a re d b y a ll p ro ce s s e s in Us e r Mo d e . Th e fie ld s o f t h e co rre s p o n d in g S e g m e n t De s crip t o r in t h e GDT h a ve t h e fo llo win g va lu e s : ❍ ❍ ❍ ❍ ❍ ❍ ❍

Base = 0x00000000 Limit = 0xfffff G ( g ra n u la rit y fla g ) = 1, fo r s e g m e n t s ize e xp re s s e d in p a g e s S ( s ys t e m fla g ) = 1, fo r n o rm a l co d e o r d a t a s e g m e n t Type = 2, fo r d a t a s e g m e n t t h a t ca n b e re a d a n d writ t e n DPL ( De s crip t o r Privile g e Le ve l) = 3, fo r Us e r Mo d e D/B ( 3 2 - b it a d d re s s fla g ) = 1, fo r 3 2 - b it o ffs e t a d d re s s e s

Th is s e g m e n t o ve rla p s t h e p re vio u s o n e : t h e y a re id e n t ica l, e xce p t fo r t h e va lu e o f Type. Th e co rre s p o n d in g S e g m e n t S e le ct o r is d e fin e d b y t h e _ _USER_DS m a cro .



A Ta s k S t a t e S e g m e n t ( TS S ) fo r e a ch p ro ce s s o r. Th e lin e a r a d d re s s s p a ce co rre s p o n d in g t o e a ch TS S is a s m a ll s u b s e t o f t h e lin e a r a d d re s s s p a ce co rre s p o n d in g t o t h e ke rn e l d a t a s e g m e n t . All t h e Ta s k S t a t e S e g m e n t s a re s e q u e n t ia lly s t o re d in t h e init_tss a rra y; in p a rt icu la r, t h e Base fie ld o f t h e TS S d e s crip t o r fo r t h e n t h CPU p o in t s t o t h e n t h co m p o n e n t o f t h e init_tss a rra y. Th e

G ( g ra n u la rit y) fla g is cle a re d , wh ile t h e Limit fie ld is s e t t o 0xeb, s in ce t h e TS S

s e g m e n t is 2 3 6 b yt e s lo n g . Th e Type fie ld is s e t t o 9 o r 1 1 ( a va ila b le 3 2 - b it TS S ) , a n d t h e DPL is s e t t o 0 , s in ce p ro ce s s e s in Us e r Mo d e a re n o t a llo we d t o a cce s s TS S ●

s e g m e n t s . Yo u will fin d d e t a ils o n h o w Lin u x u s e s TS S s in S e ct io n 3 . 3 . 2 . A d e fa u lt Lo ca l De s crip t o r Ta b le ( LDT) t h a t is u s u a lly s h a re d b y a ll p ro ce s s e s . Th is s e g m e n t is s t o re d in t h e default_ldt va ria b le . Th e d e fa u lt LDT in clu d e s a s in g le e n t ry co n s is t in g o f a n u ll S e g m e n t De s crip t o r. Ea ch p ro ce s s o r h a s it s o wn LDT S e g m e n t De s crip t o r, wh ich u s u a lly p o in t s t o t h e co m m o n d e fa u lt LDT s e g m e n t ; it s Base fie ld is s e t t o t h e a d d re s s o f default_ldt a n d it s Limit fie ld is s e t t o 7 .



Wh e n a p ro ce s s re q u irin g a n o n e m p t y LDT is ru n n in g , t h e LDT d e s crip t o r in t h e GDT co rre s p o n d in g t o t h e e xe cu t in g CPU is re p la ce d b y t h e d e s crip t o r a s s o cia t e d wit h t h e LDT t h a t wa s b u ilt b y t h e p ro ce s s . Yo u will fin d m o re d e t a ils o f t h is m e ch a n is m in Ch a p t e r 3 . Fo u r s e g m e n t s re la t e d t o t h e Ad va n ce d Po we r Ma n a g e m e n t ( APM) s u p p o rt . APM co n s is t s o f a s e t o f BIOS ro u t in e s d e vo t e d t o t h e m a n a g e m e n t o f t h e p o we r s t a t e s o f t h e s ys t e m . If t h e ke rn e l s u p p o rt s APM, fo u r e n t rie s in t h e GDT s t o re t h e d e s crip t o rs o f t wo d a t a s e g m e n t s a n d t wo co d e s e g m e n t s co n t a in in g APM- re la t e d ke rn e l fu n ct io n s . Fig u re 2 - 5 . Th e Glo b a l D e s c rip t o r Ta b le

In co n clu s io n , a s s h o wn in Fig u re 2 - 5 , t h e GDT in clu d e s a s e t o f co m m o n d e s crip t o rs p lu s a p a ir o f s e g m e n t d e s crip t o rs fo r e a ch e xis t in g CPU — o n e fo r t h e TS S s e g m e n t a n d o n e fo r t h e LDT s e g m e n t . Fo r e fficie n cy, s o m e e n t rie s in t h e GDT t a b le a re le ft u n u s e d , s o t h a t s e g m e n t d e s crip t o rs u s u a lly a cce s s e d t o g e t h e r a re ke p t in t h e s a m e 3 2 - b yt e lin e o f t h e h a rd wa re ca ch e ( s e e S e ct io n 2 . 4 . 7 la t e r in t h is ch a p t e r) .

As s t a t e d e a rlie r, t h e Cu rre n t Privile g e Le ve l o f t h e CPU in d ica t e s wh e t h e r t h e p ro ce s s o r is in Us e r o r Ke rn e l Mo d e a n d is s p e cifie d b y t h e RPL fie ld o f t h e S e g m e n t S e le ct o r s t o re d in t h e

cs re g is t e r. Wh e n e ve r t h e CPL is ch a n g e d , s o m e s e g m e n t a t io n re g is t e rs m u s t b e co rre s p o n d in g ly u p d a t e d . Fo r in s t a n ce , wh e n t h e CPL is e q u a l t o 3 ( Us e r Mo d e ) , t h e ds re g is t e r m u s t co n t a in t h e S e g m e n t S e le ct o r o f t h e u s e r d a t a s e g m e n t , b u t wh e n t h e CPL is e q u a l t o 0 , t h e ds re g is t e r m u s t co n t a in t h e S e g m e n t S e le ct o r o f t h e ke rn e l d a t a s e g m e n t . A s im ila r s it u a t io n o ccu rs fo r t h e ss re g is t e r. It m u s t re fe r t o a Us e r Mo d e s t a ck in s id e t h e u s e r d a t a s e g m e n t wh e n t h e CPL is 3 , a n d it m u s t re fe r t o a Ke rn e l Mo d e s t a ck in s id e t h e ke rn e l d a t a s e g m e n t wh e n t h e CPL is 0 . Wh e n s wit ch in g fro m Us e r Mo d e t o Ke rn e l Mo d e , Lin u x a lwa ys m a ke s s u re t h a t t h e ss re g is t e r co n t a in s t h e S e g m e n t S e le ct o r o f t h e ke rn e l da ta se gm e nt.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

2.4 Paging in Hardware Th e p a g in g u n it t ra n s la t e s lin e a r a d d re s s e s in t o p h ys ica l o n e s . It ch e cks t h e re q u e s t e d a cce s s t yp e a g a in s t t h e a cce s s rig h t s o f t h e lin e a r a d d re s s . If t h e m e m o ry a cce s s is n o t va lid , it g e n e ra t e s a Pa g e Fa u lt e xce p t io n ( s e e Ch a p t e r 4 a n d Ch a p t e r 7 ) . Fo r t h e s a ke o f e fficie n cy, lin e a r a d d re s s e s a re g ro u p e d in fixe d - le n g t h in t e rva ls ca lle d p a g e s ; co n t ig u o u s lin e a r a d d re s s e s wit h in a p a g e a re m a p p e d in t o co n t ig u o u s p h ys ica l a d d re s s e s . In t h is wa y, t h e ke rn e l ca n s p e cify t h e p h ys ica l a d d re s s a n d t h e a cce s s rig h t s o f a p a g e in s t e a d o f t h o s e o f a ll t h e lin e a r a d d re s s e s in clu d e d in it . Fo llo win g t h e u s u a l co n ve n t io n , we s h a ll u s e t h e t e rm "p a g e " t o re fe r b o t h t o a s e t o f lin e a r a d d re s s e s a n d t o t h e d a t a co n t a in e d in t h is g ro u p o f a d d re s s e s . Th e p a g in g u n it t h in ks o f a ll RAM a s p a rt it io n e d in t o fixe d - le n g t h p a g e fra m e s ( s o m e t im e s re fe rre d t o a s p h y s ica l p a g e s ) . Ea ch p a g e fra m e co n t a in s a p a g e — t h a t is , t h e le n g t h o f a p a g e fra m e co in cid e s wit h t h a t o f a p a g e . A p a g e fra m e is a co n s t it u e n t o f m a in m e m o ry, a n d h e n ce it is a s t o ra g e a re a . It is im p o rt a n t t o d is t in g u is h a p a g e fro m a p a g e fra m e ; t h e fo rm e r is ju s t a b lo ck o f d a t a , wh ich m a y b e s t o re d in a n y p a g e fra m e o r o n d is k. Th e d a t a s t ru ct u re s t h a t m a p lin e a r t o p h ys ica l a d d re s s e s a re ca lle d Pa g e Ta b le s ; t h e y a re s t o re d in m a in m e m o ry a n d m u s t b e p ro p e rly in it ia lize d b y t h e ke rn e l b e fo re e n a b lin g t h e p a g in g u n it . In 8 0 x 8 6 p ro ce s s o rs , p a g in g is e n a b le d b y s e t t in g t h e PG fla g o f a co n t ro l re g is t e r n a m e d

cr0. Wh e n PG = 0, lin e a r a d d re s s e s a re in t e rp re t e d a s p h ys ica l a d d re s s e s . 2.4.1 Regular Paging S t a rt in g wit h t h e 8 0 3 8 6 , t h e p a g in g u n it o f In t e l p ro ce s s o rs h a n d le s 4 KB p a g e s . Th e 3 2 b it s o f a lin e a r a d d re s s a re d ivid e d in t o t h re e fie ld s : Dire ct o ry Th e m o s t s ig n ifica n t 1 0 b it s Ta b le Th e in t e rm e d ia t e 1 0 b it s Offs e t Th e le a s t s ig n ifica n t 1 2 b it s Th e t ra n s la t io n o f lin e a r a d d re s s e s is a cco m p lis h e d in t wo s t e p s , e a ch b a s e d o n a t yp e o f t ra n s la t io n t a b le . Th e firs t t ra n s la t io n t a b le is ca lle d t h e Pa g e Dire ct o ry a n d t h e s e co n d is ca lle d t h e Pa g e Ta b le . Th e a im o f t h is t wo - le ve l s ch e m e is t o re d u ce t h e a m o u n t o f RAM re q u ire d fo r p e r- p ro ce s s

Pa g e Ta b le s . If a s im p le o n e - le ve l Pa g e Ta b le wa s u s e d , t h e n it wo u ld re q u ire u p t o 2 2 0 e n t rie s ( i. e . , a t 4 b yt e s p e r e n t ry, 4 MB o f RAM) t o re p re s e n t t h e Pa g e Ta b le fo r e a ch p ro ce s s ( if t h e p ro ce s s u s e d a fu ll 4 GB lin e a r a d d re s s s p a ce ) , e ve n t h o u g h a p ro ce s s d o e s n o t u s e a ll a d d re s s e s in t h a t ra n g e . Th e t wo - le ve l s ch e m e re d u ce s t h e m e m o ry b y re q u irin g Pa g e Ta b le s o n ly fo r t h o s e virt u a l m e m o ry re g io n s a ct u a lly u s e d b y a p ro ce s s . Ea ch a ct ive p ro ce s s m u s t h a ve a Pa g e Dire ct o ry a s s ig n e d t o it . Ho we ve r, t h e re is n o n e e d t o a llo ca t e RAM fo r a ll Pa g e Ta b le s o f a p ro ce s s a t o n ce ; it is m o re e fficie n t t o a llo ca t e RAM fo r a Pa g e Ta b le o n ly wh e n t h e p ro ce s s e ffe ct ive ly n e e d s it . Th e p h ys ica l a d d re s s o f t h e Pa g e Dire ct o ry in u s e is s t o re d in a co n t ro l re g is t e r n a m e d cr3. Th e Dire ct o ry fie ld wit h in t h e lin e a r a d d re s s d e t e rm in e s t h e e n t ry in t h e Pa g e Dire ct o ry t h a t p o in t s t o t h e p ro p e r Pa g e Ta b le . Th e a d d re s s 's Ta b le fie ld , in t u rn , d e t e rm in e s t h e e n t ry in t h e Pa g e Ta b le t h a t co n t a in s t h e p h ys ica l a d d re s s o f t h e p a g e fra m e co n t a in in g t h e p a g e . Th e Offs e t fie ld d e t e rm in e s t h e re la t ive p o s it io n wit h in t h e p a g e fra m e ( s e e Fig u re 2 - 6 ) . S in ce it is 1 2 b it s lo n g , e a ch p a g e co n s is t s o f 4 , 0 9 6 b yt e s o f d a t a . Fig u re 2 - 6 . P a g in g b y 8 0 x 8 6 p ro c e s s o rs

Bo t h t h e Dire ct o ry a n d t h e Ta b le fie ld s a re 1 0 b it s lo n g , s o Pa g e Dire ct o rie s a n d Pa g e Ta b le s ca n in clu d e u p t o 1 , 0 2 4 e n t rie s . It fo llo ws t h a t a Pa g e Dire ct o ry ca n a d d re s s u p t o 1 0 2 4 x 1 0 2 4 x 4 0 9 6 = 2 3 2 m e m o ry ce lls , a s yo u 'd e xp e ct in 3 2 - b it a d d re s s e s . Th e e n t rie s o f Pa g e Dire ct o rie s a n d Pa g e Ta b le s h a ve t h e s a m e s t ru ct u re . Ea ch e n t ry in clu d e s t h e fo llo win g fie ld s :

Present fla g If it is s e t , t h e re fe rre d - t o p a g e ( o r Pa g e Ta b le ) is co n t a in e d in m a in m e m o ry; if t h e fla g is 0 , t h e p a g e is n o t co n t a in e d in m a in m e m o ry a n d t h e re m a in in g e n t ry b it s m a y b e u s e d b y t h e o p e ra t in g s ys t e m fo r it s o wn p u rp o s e s . If t h e e n t ry o f a Pa g e

Ta b le o r Pa g e Dire ct o ry n e e d e d t o p e rfo rm a n a d d re s s t ra n s la t io n h a s t h e Present fla g cle a re d , t h e p a g in g u n it s t o re s t h e lin e a r a d d re s s in a co n t ro l re g is t e r n a m e d cr2 a n d g e n e ra t e s t h e e xce p t io n 1 4 : t h e Pa g e Fa u lt e xce p t io n . ( We s h a ll s e e in Ch a p t e r 1 6 h o w Lin u x u s e s t h is fie ld . ) Fie ld co n t a in in g t h e 2 0 m o s t s ig n ifica n t b it s o f a p a g e fra m e p h y s ica l a d d re s s S in ce e a ch p a g e fra m e h a s a 4 - KB ca p a cit y, it s p h ys ica l a d d re s s m u s t b e a m u lt ip le o f 4 , 0 9 6 s o t h e 1 2 le a s t s ig n ifica n t b it s o f t h e p h ys ica l a d d re s s a re a lwa ys e q u a l t o 0 . If t h e fie ld re fe rs t o a Pa g e Dire ct o ry, t h e p a g e fra m e co n t a in s a Pa g e Ta b le ; if it re fe rs t o a Pa g e Ta b le , t h e p a g e fra m e co n t a in s a p a g e o f d a t a .

Accessed fla g S e t s e a ch t im e t h e p a g in g u n it a d d re s s e s t h e co rre s p o n d in g p a g e fra m e . Th is fla g m a y b e u s e d b y t h e o p e ra t in g s ys t e m wh e n s e le ct in g p a g e s t o b e s wa p p e d o u t . Th e p a g in g u n it n e ve r re s e t s t h is fla g ; t h is m u s t b e d o n e b y t h e o p e ra t in g s ys t e m .

Dirty fla g Ap p lie s o n ly t o t h e Pa g e Ta b le e n t rie s . It is s e t e a ch t im e a writ e o p e ra t io n is p e rfo rm e d o n t h e p a g e fra m e . As fo r t h e Accessed fla g , Dirty m a y b e u s e d b y t h e o p e ra t in g s ys t e m wh e n s e le ct in g p a g e s t o b e s wa p p e d o u t . Th e p a g in g u n it n e ve r re s e t s t h is fla g ; t h is m u s t b e d o n e b y t h e o p e ra t in g s ys t e m .

Read/Write fla g Co n t a in s t h e a cce s s rig h t ( Re a d / Writ e o r Re a d ) o f t h e p a g e o r o f t h e Pa g e Ta b le ( s e e S e ct io n 2 . 4 . 3 la t e r in t h is ch a p t e r) .

User/Supervisor fla g Co n t a in s t h e p rivile g e le ve l re q u ire d t o a cce s s t h e p a g e o r Pa g e Ta b le ( s e e t h e la t e r s e ct io n S e ct io n 2 . 4 . 3 ) .

PCD a n d PWT fla g s Co n t ro ls t h e wa y t h e p a g e o r Pa g e Ta b le is h a n d le d b y t h e h a rd wa re ca ch e ( s e e S e ct io n 2 . 4 . 7 la t e r in t h is ch a p t e r) .

Page Size fla g Ap p lie s o n ly t o Pa g e Dire ct o ry e n t rie s . If it is s e t , t h e e n t ry re fe rs t o a 2 MB- o r 4 MBlo n g p a g e fra m e ( s e e t h e fo llo win g s e ct io n s ) .

Global fla g Ap p lie s o n ly t o Pa g e Ta b le e n t rie s . Th is fla g wa s in t ro d u ce d in t h e Pe n t iu m Pro t o p re ve n t fre q u e n t ly u s e d p a g e s fro m b e in g flu s h e d fro m t h e TLB ca ch e ( s e e S e ct io n

2 . 4 . 8 la t e r in t h is ch a p t e r) . It wo rks o n ly if t h e Pa g e Glo b a l En a b le ( PGE) fla g o f re g is t e r cr4 is s e t .

2.4.2 Extended Paging S t a rt in g wit h t h e Pe n t iu m m o d e l, 8 0 x 8 6 m icro p ro ce s s o rs in t ro d u ce e x t e n d e d p a g in g , wh ich a llo ws p a g e fra m e s t o b e 4 MB in s t e a d o f 4 KB in s ize ( s e e Fig u re 2 - 7 ) . Fig u re 2 - 7 . Ex t e n d e d p a g in g

As m e n t io n e d in t h e p re vio u s s e ct io n , e xt e n d e d p a g in g is e n a b le d b y s e t t in g t h e Page Size fla g o f a Pa g e Dire ct o ry e n t ry. In t h is ca s e , t h e p a g in g u n it d ivid e s t h e 3 2 b it s o f a lin e a r a d d re s s in t o t wo fie ld s : Dire ct o ry Th e m o s t s ig n ifica n t 1 0 b it s Offs e t Th e re m a in in g 2 2 b it s Pa g e Dire ct o ry e n t rie s fo r e xt e n d e d p a g in g a re t h e s a m e a s fo r n o rm a l p a g in g , e xce p t t h a t : ●

Th e Page Size fla g m u s t b e s e t .



On ly t h e 1 0 m o s t s ig n ifica n t b it s o f t h e 2 0 - b it p h ys ica l a d d re s s fie ld a re s ig n ifica n t . Th is is b e ca u s e e a ch p h ys ica l a d d re s s is a lig n e d o n a 4 - MB b o u n d a ry, s o t h e 2 2 le a s t s ig n ifica n t b it s o f t h e a d d re s s a re 0 .

Ext e n d e d p a g in g co e xis t s wit h re g u la r p a g in g ; it is e n a b le d b y s e t t in g t h e PSE fla g o f t h e

cr4 p ro ce s s o r re g is t e r. Ext e n d e d p a g in g is u s e d t o t ra n s la t e la rg e co n t ig u o u s lin e a r a d d re s s

ra n g e s in t o co rre s p o n d in g p h ys ica l o n e s ; in t h e s e ca s e s , t h e ke rn e l ca n d o wit h o u t in t e rm e d ia t e Pa g e Ta b le s a n d t h u s s a ve m e m o ry a n d p re s e rve TLB e n t rie s ( s e e S e ct io n 2.4.8).

2.4.3 Hardware Protection Scheme Th e p a g in g u n it u s e s a d iffe re n t p ro t e ct io n s ch e m e fro m t h e s e g m e n t a t io n u n it . Wh ile 8 0 x 8 6 p ro ce s s o rs a llo w fo u r p o s s ib le p rivile g e le ve ls t o a s e g m e n t , o n ly t wo p rivile g e le ve ls a re a s s o cia t e d wit h p a g e s a n d Pa g e Ta b le s , b e ca u s e p rivile g e s a re co n t ro lle d b y t h e User/Supervisor fla g m e n t io n e d in t h e e a rlie r s e ct io n S e ct io n 2 . 4 . 1 . Wh e n t h is fla g is 0 , t h e p a g e ca n b e a d d re s s e d o n ly wh e n t h e CPL is le s s t h a n 3 ( t h is m e a n s , fo r Lin u x, wh e n t h e p ro ce s s o r is in Ke rn e l Mo d e ) . Wh e n t h e fla g is 1 , t h e p a g e ca n a lwa ys b e a d d re s s e d . Fu rt h e rm o re , in s t e a d o f t h e t h re e t yp e s o f a cce s s rig h t s ( Re a d , Writ e , a n d Exe cu t e ) a s s o cia t e d wit h s e g m e n t s , o n ly t wo t yp e s o f a cce s s rig h t s ( Re a d a n d Writ e ) a re a s s o cia t e d wit h p a g e s . If t h e Read/Write fla g o f a Pa g e Dire ct o ry o r Pa g e Ta b le e n t ry is e q u a l t o 0 , t h e co rre s p o n d in g Pa g e Ta b le o r p a g e ca n o n ly b e re a d ; o t h e rwis e it ca n b e re a d a n d writ t e n .

2.4.4 An Example of Regular Paging A s im p le e xa m p le will h e lp in cla rifyin g h o w re g u la r p a g in g wo rks . Le t 's a s s u m e t h a t t h e ke rn e l a s s ig n s t h e lin e a r a d d re s s s p a ce b e t we e n 0x20000000 a n d

0x2003ffff t o a ru n n in g p ro ce s s . [ 2 ] Th is s p a ce co n s is t s o f e xa ct ly 6 4 p a g e s . We d o n 't ca re a b o u t t h e p h ys ica l a d d re s s e s o f t h e p a g e fra m e s co n t a in in g t h e p a g e s ; in fa ct , s o m e o f t h e m m ig h t n o t e ve n b e in m a in m e m o ry. We a re in t e re s t e d o n ly in t h e re m a in in g fie ld s o f t h e Pa g e Ta b le e n t rie s . [2]

As we s h a ll s e e in t h e fo llo win g ch a p t e rs , t h e 3 GB lin e a r a d d re s s s p a ce is a n u p p e r lim it , b u t a Us e r Mo d e p ro ce s s is a llo we d t o re fe re n ce o n ly a s u b s e t o f it .

Le t 's s t a rt wit h t h e 1 0 m o s t s ig n ifica n t b it s o f t h e lin e a r a d d re s s e s a s s ig n e d t o t h e p ro ce s s , wh ich a re in t e rp re t e d a s t h e Dire ct o ry fie ld b y t h e p a g in g u n it . Th e a d d re s s e s s t a rt wit h a 2 fo llo we d b y ze ro s , s o t h e 1 0 b it s a ll h a ve t h e s a m e va lu e , n a m e ly 0x080 o r 1 2 8 d e cim a l. Th u s t h e Dire ct o ry fie ld in a ll t h e a d d re s s e s re fe rs t o t h e 1 2 9 t h e n t ry o f t h e p ro ce s s Pa g e Dire ct o ry. Th e co rre s p o n d in g e n t ry m u s t co n t a in t h e p h ys ica l a d d re s s o f t h e Pa g e Ta b le a s s ig n e d t o t h e p ro ce s s ( s e e Fig u re 2 - 8 ) . If n o o t h e r lin e a r a d d re s s e s a re a s s ig n e d t o t h e p ro ce s s , a ll t h e re m a in in g 1 , 0 2 3 e n t rie s o f t h e Pa g e Dire ct o ry a re fille d wit h ze ro s . Fig u re 2 - 8 . An e x a m p le o f p a g in g

Th e va lu e s a s s u m e d b y t h e in t e rm e d ia t e 1 0 b it s , ( t h a t is , t h e va lu e s o f t h e Ta b le fie ld ) ra n g e fro m 0 t o 0x03f, o r fro m t o 6 3 d e cim a l. Th u s , o n ly t h e firs t 6 4 e n t rie s o f t h e Pa g e Ta b le a re s ig n ifica n t . Th e re m a in in g 9 6 0 e n t rie s a re fille d wit h ze ro s . S u p p o s e t h a t t h e p ro ce s s n e e d s t o re a d t h e b yt e a t lin e a r a d d re s s 0x20021406. Th is a d d re s s is h a n d le d b y t h e p a g in g u n it a s fo llo ws : 1 . Th e Dire ct o ry fie ld 0x80 is u s e d t o s e le ct e n t ry 0x80 o f t h e Pa g e Dire ct o ry, wh ich p o in t s t o t h e Pa g e Ta b le a s s o cia t e d wit h t h e p ro ce s s 's p a g e s . 2 . Th e Ta b le fie ld 0x21 is u s e d t o s e le ct e n t ry 0x21 o f t h e Pa g e Ta b le , wh ich p o in t s t o t h e p a g e fra m e co n t a in in g t h e d e s ire d p a g e . 3 . Fin a lly, t h e Offs e t fie ld 0x406 is u s e d t o s e le ct t h e b yt e a t o ffs e t 0x406 in t h e d e s ire d p a g e fra m e . If t h e Present fla g o f t h e 0x21 e n t ry o f t h e Pa g e Ta b le is cle a re d , t h e p a g e is n o t p re s e n t in m a in m e m o ry; in t h is ca s e , t h e p a g in g u n it is s u e s a Pa g e Fa u lt e xce p t io n wh ile t ra n s la t in g t h e lin e a r a d d re s s . Th e s a m e e xce p t io n is is s u e d wh e n e ve r t h e p ro ce s s a t t e m p t s t o a cce s s lin e a r a d d re s s e s o u t s id e o f t h e in t e rva l d e lim it e d b y 0x20000000 a n d 0x2003ffff s in ce t h e Pa g e Ta b le e n t rie s n o t a s s ig n e d t o t h e p ro ce s s a re fille d wit h ze ro s ; in p a rt icu la r, t h e ir Present fla g s a re a ll cle a re d .

2.4.5 Three-Level Paging Two - le ve l p a g in g is u s e d b y 3 2 - b it m icro p ro ce s s o rs . Bu t in re ce n t ye a rs , s e ve ra l m icro p ro ce s s o rs ( s u ch a s He wle t t - Pa cka rd 's Alp h a , In t e l's It a n iu m , a n d S u n 's Ult ra S PARC) h a ve a d o p t e d a 6 4 - b it a rch it e ct u re . In t h is ca s e , t wo - le ve l p a g in g is n o lo n g e r s u it a b le a n d it is n e ce s s a ry t o m o ve u p t o t h re e - le ve l p a g in g . Le t 's u s e a t h o u g h t e xp e rim e n t t o e xp la in wh y. S t a rt b y a s s u m in g a s la rg e a p a g e s ize a s is re a s o n a b le ( s in ce yo u h a ve t o a cco u n t fo r p a g e s b e in g t ra n s fe rre d ro u t in e ly t o a n d fro m d is k) . Le t 's ch o o s e 1 6 KB fo r t h e p a g e s ize . S in ce 1 KB co ve rs a ra n g e o f 2 1 0 a d d re s s e s , 1 6 KB co ve rs 2 1 4 a d d re s s e s , s o t h e Offs e t fie ld is 1 4 b it s . Th is le a ve s 5 0 b it s o f t h e lin e a r a d d re s s t o b e d is t rib u t e d b e t we e n t h e Ta b le a n d t h e Dire ct o ry fie ld s . If we n o w d e cid e t o re s e rve 2 5 b it s fo r e a ch o f t h e s e t wo fie ld s , t h is m e a n s t h a t b o t h t h e Pa g e Dire ct o ry a n d t h e Pa g e Ta b le s o f e a ch p ro ce s s in clu d e s 2 2 5 e n t rie s — t h a t is , m o re t h a n 3 2 m illio n e n t rie s . Eve n if RAM is g e t t in g ch e a p e r a n d ch e a p e r, we ca n n o t a ffo rd t o wa s t e s o m u ch m e m o ry

s p a ce ju s t fo r s t o rin g t h e Pa g e Ta b le s . Th e s o lu t io n ch o s e n fo r t h e He wle t t - Pa cka rd 's Alp h a m icro p ro ce s s o r — o n e o f t h e firs t 6 4 - b it CPUs t h a t a p p e a re d o n t h e m a rke t — is t h e fo llo win g : ● ●



Pa g e fra m e s a re 8 KB lo n g , s o t h e Offs e t fie ld is 1 3 b it s lo n g . On ly t h e le a s t s ig n ifica n t 4 3 b it s o f a n a d d re s s a re u s e d . ( Th e m o s t s ig n ifica n t 2 1 b it s a re a lwa ys s e t t o 0 . ) Th re e le ve ls o f Pa g e Ta b le s a re in t ro d u ce d s o t h a t t h e re m a in in g 3 0 b it s o f t h e a d d re s s ca n b e s p lit in t o t h re e 1 0 - b it fie ld s ( s e e Fig u re 2 - 1 1 la t e r in t h is ch a p t e r) . Th u s , t h e Pa g e Ta b le s in clu d e 2 1 0 = 1 0 2 4 e n t rie s a s in t h e t wo - le ve l p a g in g s ch e m a e xa m in e d p re vio u s ly.

As we s h a ll s e e in S e ct io n 2 . 5 la t e r in t h is ch a p t e r, Lin u x's d e s ig n e rs d e cid e d t o im p le m e n t a p a g in g m o d e l in s p ire d b y t h e Alp h a a rch it e ct u re .

2.4.6 The Physical Address Extension (PAE) Paging Mechanism Th e a m o u n t o f RAM s u p p o rt e d b y a p ro ce s s o r is lim it e d b y t h e n u m b e r o f a d d re s s p in s co n n e ct e d t o t h e a d d re s s b u s . Old e r In t e l p ro ce s s o rs fro m t h e 8 0 3 8 6 t o t h e Pe n t iu m u s e d 3 2 b it p h ys ica l a d d re s s e s . In t h e o ry, u p t o 4 GB o f RAM co u ld b e in s t a lle d o n s u ch s ys t e m s ; in p ra ct ice , d u e t o t h e lin e a r a d d re s s s p a ce re q u ire m e n t s o f Us e r Mo d e p ro ce s s e s , t h e ke rn e l ca n n o t d ire ct ly a d d re s s m o re t h a n 1 GB o f RAM, a s we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 2.5. Ho we ve r, s o m e d e m a n d in g a p p lica t io n s ru n n in g o n la rg e s e rve rs re q u ire m o re t h a n 1 GB o f RAM, a n d in re ce n t ye a rs t h is cre a t e d a p re s s u re o n In t e l t o e xp a n d t h e a m o u n t o f RAM s u p p o rt e d o n t h e 3 2 - b it 8 0 3 8 6 a rch it e ct u re . In t e l h a s s a t is fie d t h e s e re q u e s t s b y in cre a s in g t h e n u m b e r o f a d d re s s p in s o n it s p ro ce s s o rs fro m 3 2 t o 3 6 . S t a rt in g wit h t h e Pe n t iu m Pro , a ll In t e l p ro ce s s o rs a re n o w a b le t o a d d re s s u p t o 2 3 6 = 6 4 GB o f RAM. Ho we ve r, t h e in cre a s e d ra n g e o f p h ys ica l a d d re s s e s ca n b e e xp lo it e d o n ly b y in t ro d u cin g a n e w p a g in g m e ch a n is m t h a t t ra n s la t e s 3 2 - b it lin e a r a d d re s s e s in t o 3 6 b it p h ys ica l o n e s . Wit h t h e Pe n t iu m Pro p ro ce s s o r, In t e l in t ro d u ce d a m e ch a n is m ca lle d Ph y s ica l Ad d re s s Ex t e n s io n ( PAE) . An o t h e r m e ch a n is m , Pa g e S ize Ext e n s io n ( PS E- 3 6 ) , wa s in t ro d u ce d in t h e Pe n t iu m III p ro ce s s o r, b u t Lin u x d o e s n o t u s e it a n d we wo n 't d is cu s s it fu rt h e r in t h is b o o k. PAE is a ct iva t e d b y s e t t in g t h e Ph ys ica l Ad d re s s Ext e n s io n ( PAE) fla g in t h e cr4 co n t ro l re g is t e r. Th e Pa g e S ize Ext e n s io n ( PSE) fla g in t h e cr4 co n t ro l re g is t e r e n a b le s la rg e p a g e s ize s ( 2 MB wh e n PAE is e n a b le d ) . In t e l h a s ch a n g e d t h e p a g in g m e ch a n is m in o rd e r t o s u p p o rt PAE. ●



Th e 6 4 GB o f RAM a re s p lit in t o 2 2 4 d is t in ct p a g e fra m e s , a n d t h e p h ys ica l a d d re s s fie ld o f Pa g e Ta b le e n t rie s h a s b e e n e xp a n d e d fro m 2 0 t o 2 4 b it s . S in ce a PAE Pa g e Ta b le e n t ry m u s t in clu d e t h e 1 2 fla g b it s ( d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 2 . 4 . 1 ) a n d t h e 2 4 p h ys ica l a d d re s s b it s , fo r a g ra n d t o t a l o f 3 6 , t h e Pa g e Ta b le e n t ry s ize h a s b e e n d o u b le d fro m 3 2 b it s t o 6 4 b it s . As a re s u lt , a 4 KB PAE Pa g e Ta b le in clu d e s 5 1 2 e n t rie s in s t e a d o f 1 , 0 2 4 . A n e w le ve l o f Pa g e Ta b le ca lle d t h e Pa g e Dire ct o ry Po in t e r Ta b le ( PDPT) co n s is t in g



o f fo u r 6 4 - b it e n t rie s h a s b e e n in t ro d u ce d . Th e cr3 co n t ro l re g is t e r co n t a in s a 2 7 - b it Pa g e Dire ct o ry Po in t e r Ta b le b a s e a d d re s s



fie ld . S in ce PDPTs a re s t o re d in t h e firs t 4 GB o f RAM a n d a lig n e d t o a m u lt ip le o f 3 2 b yt e s ( 2 5 ) , 2 7 b it s a re s u fficie n t t o re p re s e n t t h e b a s e a d d re s s o f s u ch t a b le s . Wh e n m a p p in g lin e a r a d d re s s e s t o 4 KB p a g e s ( PS fla g cle a re d in Pa g e Dire ct o ry e n t ry) , t h e 3 2 b it s o f a lin e a r a d d re s s a re in t e rp re t e d in t h e fo llo win g wa y:

cr3 Po in t s t o a PDPT b it s 3 1 - 3 0 Po in t t o o n e o f 4 p o s s ib le e n t rie s in PDPT b it s 2 9 - 2 1 Po in t t o o n e o f 5 1 2 p o s s ib le e n t rie s in Pa g e Dire ct o ry b it s 2 0 - 1 2 Po in t t o o n e o f 5 1 2 p o s s ib le e n t rie s in Pa g e Ta b le b it s 1 1 - 0 Offs e t o f 4 KB p a g e ●

Wh e n m a p p in g lin e a r a d d re s s e s t o 2 MB p a g e s ( PS fla g s e t in Pa g e Dire ct o ry e n t ry) , t h e 3 2 b it s o f a lin e a r a d d re s s a re in t e rp re t e d in t h e fo llo win g wa y:

cr3 Po in t s t o a PDPT b it s 3 1 - 3 0 Po in t t o o n e o f 4 p o s s ib le e n t rie s in PDPT b it s 2 9 - 2 1 Po in t t o o n e o f 5 1 2 p o s s ib le e n t rie s in Pa g e Dire ct o ry b it s 2 0 - 0 Offs e t o f 2 MB p a g e To s u m m a rize , o n ce cr3 is s e t , it is p o s s ib le t o a d d re s s u p t o 4 GB o f RAM. If we wa n t t o a d d re s s m o re RAM, we 'll h a ve t o p u t a n e w va lu e in cr3 o r ch a n g e t h e co n t e n t o f t h e PDPT.

Ho we ve r, t h e m a in p ro b le m wit h PAE is t h a t lin e a r a d d re s s e s a re s t ill 3 2 - b it s lo n g . Th is fo rce s p ro g ra m m e rs t o re u s e t h e s a m e lin e a r a d d re s s e s t o m a p d iffe re n t a re a s o f RAM. We 'll s ke t ch h o w Lin u x in it ia lize s Pa g e Ta b le s wh e n PAE is e n a b le d in t h e la t e r s e ct io n , S e ct io n 2.5.5.4.

2.4.7 Hardware Cache To d a y's m icro p ro ce s s o rs h a ve clo ck ra t e s o f s e ve ra l g ig a h e rt z, wh ile d yn a m ic RAM ( DRAM) ch ip s h a ve a cce s s t im e s in t h e ra n g e o f h u n d re d s o f clo ck cycle s . Th is m e a n s t h a t t h e CPU m a y b e h e ld b a ck co n s id e ra b ly wh ile e xe cu t in g in s t ru ct io n s t h a t re q u ire fe t ch in g o p e ra n d s fro m RAM a n d / o r s t o rin g re s u lt s in t o RAM. Ha rd wa re ca ch e m e m o rie s we re in t ro d u ce d t o re d u ce t h e s p e e d m is m a t ch b e t we e n CPU a n d RAM. Th e y a re b a s e d o n t h e we ll- kn o wn lo ca lit y p rin cip le , wh ich h o ld s b o t h fo r p ro g ra m s a n d d a t a s t ru ct u re s . Th is s t a t e s t h a t b e ca u s e o f t h e cyclic s t ru ct u re o f p ro g ra m s a n d t h e p a ckin g o f re la t e d d a t a in t o lin e a r a rra ys , a d d re s s e s clo s e t o t h e o n e s m o s t re ce n t ly u s e d h a ve a h ig h p ro b a b ilit y o f b e in g u s e d in t h e n e a r fu t u re . It t h e re fo re m a ke s s e n s e t o in t ro d u ce a s m a lle r a n d fa s t e r m e m o ry t h a t co n t a in s t h e m o s t re ce n t ly u s e d co d e a n d d a t a . Fo r t h is p u rp o s e , a n e w u n it ca lle d t h e lin e wa s in t ro d u ce d in t o t h e 8 0 x 8 6 a rch it e ct u re . It co n s is t s o f a fe w d o ze n co n t ig u o u s b yt e s t h a t a re t ra n s fe rre d in b u rs t m o d e b e t we e n t h e s lo w DRAM a n d t h e fa s t o n - ch ip s t a t ic RAM ( S RAM) u s e d t o im p le m e n t ca ch e s . Th e ca ch e is s u b d ivid e d in t o s u b s e t s o f lin e s . At o n e e xt re m e , t h e ca ch e ca n b e d ire ct m a p p e d , in wh ich ca s e a lin e in m a in m e m o ry is a lwa ys s t o re d a t t h e e xa ct s a m e lo ca t io n in t h e ca ch e . At t h e o t h e r e xt re m e , t h e ca ch e is fu lly a s s o cia t iv e , m e a n in g t h a t a n y lin e in m e m o ry ca n b e s t o re d a t a n y lo ca t io n in t h e ca ch e . Bu t m o s t ca ch e s a re t o s o m e d e g re e Nw a y s e t a s s o cia t iv e , wh e re a n y lin e o f m a in m e m o ry ca n b e s t o re d in a n y o n e o f N lin e s o f t h e ca ch e . Fo r in s t a n ce , a lin e o f m e m o ry ca n b e s t o re d in t wo d iffe re n t lin e s o f a t wo - wa y s e t a s s o cia t ive ca ch e . As s h o wn in Fig u re 2 - 9 , t h e ca ch e u n it is in s e rt e d b e t we e n t h e p a g in g u n it a n d t h e m a in m e m o ry. It in clu d e s b o t h a h a rd w a re ca ch e m e m o ry a n d a ca ch e co n t ro lle r. Th e ca ch e m e m o ry s t o re s t h e a ct u a l lin e s o f m e m o ry. Th e ca ch e co n t ro lle r s t o re s a n a rra y o f e n t rie s , o n e e n t ry fo r e a ch lin e o f t h e ca ch e m e m o ry. Ea ch e n t ry in clu d e s a t a g a n d a fe w fla g s t h a t d e s crib e t h e s t a t u s o f t h e ca ch e lin e . Th e t a g co n s is t s o f s o m e b it s t h a t a llo w t h e ca ch e co n t ro lle r t o re co g n ize t h e m e m o ry lo ca t io n cu rre n t ly m a p p e d b y t h e lin e . Th e b it s o f t h e m e m o ry p h ys ica l a d d re s s a re u s u a lly s p lit in t o t h re e g ro u p s : t h e m o s t s ig n ifica n t o n e s co rre s p o n d t o t h e t a g , t h e m id d le o n e s t o t h e ca ch e co n t ro lle r s u b s e t in d e x, a n d t h e le a s t s ig n ifica n t o n e s t o t h e o ffs e t wit h in t h e lin e . Fig u re 2 - 9 . P ro c e s s o r h a rd w a re c a c h e

Wh e n a cce s s in g a RAM m e m o ry ce ll, t h e CPU e xt ra ct s t h e s u b s e t in d e x fro m t h e p h ys ica l a d d re s s a n d co m p a re s t h e t a g s o f a ll lin e s in t h e s u b s e t wit h t h e h ig h - o rd e r b it s o f t h e p h ys ica l a d d re s s . If a lin e wit h t h e s a m e t a g a s t h e h ig h - o rd e r b it s o f t h e a d d re s s is fo u n d , t h e CPU h a s a ca ch e h it ; o t h e rwis e , it h a s a ca ch e m is s . Wh e n a ca ch e h it o ccu rs , t h e ca ch e co n t ro lle r b e h a ve s d iffe re n t ly, d e p e n d in g o n t h e a cce s s t yp e . Fo r a re a d o p e ra t io n , t h e co n t ro lle r s e le ct s t h e d a t a fro m t h e ca ch e lin e a n d t ra n s fe rs it in t o a CPU re g is t e r; t h e RAM is n o t a cce s s e d a n d t h e CPU s a ve s t im e , wh ich is wh y t h e ca ch e s ys t e m wa s in ve n t e d . Fo r a writ e o p e ra t io n , t h e co n t ro lle r m a y im p le m e n t o n e o f t wo b a s ic s t ra t e g ie s ca lle d w rit e - t h ro u g h a n d w rit e - b a ck . In a writ e - t h ro u g h , t h e co n t ro lle r a lwa ys writ e s in t o b o t h RAM a n d t h e ca ch e lin e , e ffe ct ive ly s wit ch in g o ff t h e ca ch e fo r writ e o p e ra t io n s . In a writ e - b a ck, wh ich o ffe rs m o re im m e d ia t e e fficie n cy, o n ly t h e ca ch e lin e is u p d a t e d a n d t h e co n t e n t s o f t h e RAM a re le ft u n ch a n g e d . Aft e r a writ e - b a ck, o f co u rs e , t h e RAM m u s t e ve n t u a lly b e u p d a t e d . Th e ca ch e co n t ro lle r writ e s t h e ca ch e lin e b a ck in t o RAM o n ly wh e n t h e CPU e xe cu t e s a n in s t ru ct io n re q u irin g a flu s h o f ca ch e e n t rie s o r wh e n a FLUS H h a rd wa re s ig n a l o ccu rs ( u s u a lly a ft e r a ca ch e m is s ) . Wh e n a ca ch e m is s o ccu rs , t h e ca ch e lin e is writ t e n t o m e m o ry, if n e ce s s a ry, a n d t h e co rre ct lin e is fe t ch e d fro m RAM in t o t h e ca ch e e n t ry. Mu lt ip ro ce s s o r s ys t e m s h a ve a s e p a ra t e h a rd wa re ca ch e fo r e ve ry p ro ce s s o r, a n d t h e re fo re n e e d a d d it io n a l h a rd wa re circu it ry t o s yn ch ro n ize t h e ca ch e co n t e n t s . As s h o wn in Fig u re 2 1 0 , e a ch CPU h a s it s o wn lo ca l h a rd wa re ca ch e . Bu t n o w u p d a t in g b e co m e s m o re t im e co n s u m in g : wh e n e ve r a CPU m o d ifie s it s h a rd wa re ca ch e , it m u s t ch e ck wh e t h e r t h e s a m e d a t a is co n t a in e d in t h e o t h e r h a rd wa re ca ch e ; if s o , it m u s t n o t ify t h e o t h e r CPU t o u p d a t e it wit h t h e p ro p e r va lu e . Th is a ct ivit y is o ft e n ca lle d ca ch e s n o o p in g . Lu ckily, a ll t h is is d o n e a t t h e h a rd wa re le ve l a n d is o f n o co n ce rn t o t h e ke rn e l. Fig u re 2 - 1 0 . Th e c a c h e s in a d u a l p ro c e s s o r.

Ca ch e t e ch n o lo g y is ra p id ly e vo lvin g . Fo r e xa m p le , t h e firs t Pe n t iu m m o d e ls in clu d e d a s in g le o n - ch ip ca ch e ca lle d t h e L1 - ca ch e . Mo re re ce n t m o d e ls a ls o in clu d e a n o t h e r la rg e r, s lo we r o n - ch ip ca ch e ca lle d t h e L2 - ca ch e . Th e co n s is t e n cy b e t we e n t h e t wo ca ch e le ve ls is im p le m e n t e d a t t h e h a rd wa re le ve l. Lin u x ig n o re s t h e s e h a rd wa re d e t a ils a n d a s s u m e s t h e re is a s in g le ca ch e . Th e CD fla g o f t h e cr0 p ro ce s s o r re g is t e r is u s e d t o e n a b le o r d is a b le t h e ca ch e circu it ry.

Th e NW fla g , in t h e s a m e re g is t e r, s p e cifie s wh e t h e r t h e writ e - t h ro u g h o r t h e writ e - b a ck s t ra t e g y is u s e d fo r t h e ca ch e s . An o t h e r in t e re s t in g fe a t u re o f t h e Pe n t iu m ca ch e is t h a t it le t s a n o p e ra t in g s ys t e m a s s o cia t e a d iffe re n t ca ch e m a n a g e m e n t p o licy wit h e a ch p a g e fra m e . Fo r t h is p u rp o s e , e a ch Pa g e Dire ct o ry a n d e a ch Pa g e Ta b le e n t ry in clu d e s t wo fla g s : PCD ( Pa g e Ca ch e Dis a b le ) , wh ich s p e cifie s wh e t h e r t h e ca ch e m u s t b e e n a b le d o r d is a b le d wh ile a cce s s in g d a t a in clu d e d in t h e p a g e fra m e ; a n d PWT ( Pa g e Writ e - Th ro u g h ) , wh ich s p e cifie s wh e t h e r t h e writ e - b a ck o r t h e writ e - t h ro u g h s t ra t e g y m u s t b e a p p lie d wh ile writ in g d a t a in t o t h e p a g e fra m e . Lin u x cle a rs t h e PCD a n d PWT fla g s o f a ll Pa g e Dire ct o ry a n d Pa g e Ta b le e n t rie s ; a s a re s u lt , ca ch in g is e n a b le d fo r a ll p a g e fra m e s a n d t h e writ e - b a ck s t ra t e g y is a lwa ys a d o p t e d fo r writ in g .

2.4.8 Translation Lookaside Buffers (TLB) Be s id e s g e n e ra l- p u rp o s e h a rd wa re ca ch e s , 8 0 x 8 6 p ro ce s s o rs in clu d e o t h e r ca ch e s ca lle d Tra n s la t io n Lo o k a s id e Bu ffe rs ( TLB) t o s p e e d u p lin e a r a d d re s s t ra n s la t io n . Wh e n a lin e a r a d d re s s is u s e d fo r t h e firs t t im e , t h e co rre s p o n d in g p h ys ica l a d d re s s is co m p u t e d t h ro u g h s lo w a cce s s e s t o t h e Pa g e Ta b le s in RAM. Th e p h ys ica l a d d re s s is t h e n s t o re d in a TLB e n t ry s o t h a t fu rt h e r re fe re n ce s t o t h e s a m e lin e a r a d d re s s ca n b e q u ickly t ra n s la t e d . In a m u lt ip ro ce s s o r s ys t e m , e a ch CPU h a s it s o wn TLB, ca lle d t h e lo ca l TLB o f t h e CPU. Co n t ra ry t o t h e L1 ca ch e , t h e co rre s p o n d in g e n t rie s o f t h e TLB n e e d n o t b e s yn ch ro n ize d b e ca u s e p ro ce s s e s ru n n in g o n t h e e xis t in g CPUs m a y a s s o cia t e t h e s a m e lin e a r a d d re s s wit h d iffe re n t p h ys ica l o n e s . Wh e n t h e cr3 co n t ro l re g is t e r o f a CPU is m o d ifie d , t h e h a rd wa re a u t o m a t ica lly in va lid a t e s a ll e n t rie s o f t h e lo ca l TLB.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

2.5 Paging in Linux As we e xp la in e d e a rlie r in S e ct io n 2 . 4 . 5 , Lin u x a d o p t e d a t h re e - le ve l p a g in g m o d e l s o p a g in g is fe a s ib le o n 6 4 - b it a rch it e ct u re s . Fig u re 2 - 1 1 s h o ws t h e m o d e l, wh ich d e fin e s t h re e t yp e s o f p a g in g t a b le s . ● ● ●

Pa g e Glo b a l Dire ct o ry Pa g e Mid d le Dire ct o ry Pa g e Ta b le

Th e Pa g e Glo b a l Dire ct o ry in clu d e s t h e a d d re s s e s o f s e ve ra l Pa g e Mid d le Dire ct o rie s , wh ich in t u rn in clu d e t h e a d d re s s e s o f s e ve ra l Pa g e Ta b le s . Ea ch Pa g e Ta b le e n t ry p o in t s t o a p a g e fra m e . Th e lin e a r a d d re s s is t h u s s p lit in t o fo u r p a rt s . Fig u re 2 - 1 1 d o e s n o t s h o w t h e b it n u m b e rs b e ca u s e t h e s ize o f e a ch p a rt d e p e n d s o n t h e co m p u t e r a rch it e ct u re . Fig u re 2 - 1 1 . Th e Lin u x p a g in g m o d e l

Lin u x's h a n d lin g o f p ro ce s s e s re lie s h e a vily o n p a g in g . In fa ct , t h e a u t o m a t ic t ra n s la t io n o f lin e a r a d d re s s e s in t o p h ys ica l o n e s m a ke s t h e fo llo win g d e s ig n o b je ct ive s fe a s ib le : ●



As s ig n a d iffe re n t p h ys ica l a d d re s s s p a ce t o e a ch p ro ce s s , e n s u rin g a n e fficie n t p ro t e ct io n a g a in s t a d d re s s in g e rro rs . Dis t in g u is h p a g e s ( g ro u p s o f d a t a ) fro m p a g e fra m e s ( p h ys ica l a d d re s s e s in m a in m e m o ry) . Th is a llo ws t h e s a m e p a g e t o b e s t o re d in a p a g e fra m e , t h e n s a ve d t o d is k a n d la t e r re lo a d e d in a d iffe re n t p a g e fra m e . Th is is t h e b a s ic in g re d ie n t o f t h e virt u a l m e m o ry m e ch a n is m ( s e e Ch a p t e r 1 6 ) .

As we s h a ll s e e in Ch a p t e r 8 , e a ch p ro ce s s h a s it s o wn Pa g e Glo b a l Dire ct o ry a n d it s o wn s e t o f Pa g e Ta b le s . Wh e n a p ro ce s s s wit ch o ccu rs ( s e e S e ct io n 3 . 3 ) , Lin u x s a ve s t h e cr3 co n t ro l re g is t e r in t h e d e s crip t o r o f t h e p ro ce s s p re vio u s ly in e xe cu t io n a n d t h e n lo a d s cr3 wit h t h e

va lu e s t o re d in t h e d e s crip t o r o f t h e p ro ce s s t o b e e xe cu t e d n e xt . Th u s , wh e n t h e n e w p ro ce s s re s u m e s it s e xe cu t io n o n t h e CPU, t h e p a g in g u n it re fe rs t o t h e co rre ct s e t o f Pa g e Ta b le s . Wh a t h a p p e n s wh e n t h is t h re e - le ve l p a g in g m o d e l is a p p lie d t o t h e Pe n t iu m , wh ich u s e s o n ly t wo t yp e s o f Pa g e Ta b le s ? Lin u x e s s e n t ia lly e lim in a t e s t h e Pa g e Mid d le Dire ct o ry fie ld b y s a yin g t h a t it co n t a in s ze ro b it s . Ho we ve r, t h e p o s it io n o f t h e Pa g e Mid d le Dire ct o ry in t h e s e q u e n ce o f p o in t e rs is ke p t s o t h a t t h e s a m e co d e ca n wo rk o n 3 2 - b it a n d 6 4 - b it a rch it e ct u re s . Th e ke rn e l ke e p s a p o s it io n fo r t h e Pa g e Mid d le Dire ct o ry b y s e t t in g t h e n u m b e r o f e n t rie s in it t o 1 a n d m a p p in g t h is s in g le e n t ry in t o t h e p ro p e r e n t ry o f t h e Pa g e Glo b a l Dire ct o ry. Ho we ve r, wh e n Lin u x u s e s t h e Ph ys ica l Ad d re s s Ext e n s io n ( PAE) m e ch a n is m o f t h e Pe n t iu m Pro a n d la t e r p ro ce s s o rs , t h e Lin u x's Pa g e Glo b a l Dire ct o ry co rre s p o n d s t o t h e 8 0 x 8 6 's Pa g e Dire ct o ry Po in t e r Ta b le , t h e Pa g e Mid d le Dire ct o ry t o t h e 8 0 x 8 6 's Pa g e Dire ct o ry, a n d t h e Lin u x's Pa g e Ta b le t o t h e 8 0 x 8 6 's Pa g e Ta b le . Ma p p in g lo g ica l t o lin e a r a d d re s s e s n o w b e co m e s a m e ch a n ica l t a s k, a lt h o u g h it is s t ill s o m e wh a t co m p le x. Th e n e xt fe w s e ct io n s o f t h is ch a p t e r a re a ra t h e r t e d io u s lis t o f fu n ct io n s a n d m a cro s t h a t re t rie ve in fo rm a t io n t h e ke rn e l n e e d s t o fin d a d d re s s e s a n d m a n a g e t h e t a b le s ; m o s t o f t h e fu n ct io n s a re o n e o r t wo lin e s lo n g . Yo u m a y wa n t t o ju s t s kim t h e s e s e ct io n s n o w, b u t it is u s e fu l t o kn o w t h e ro le o f t h e s e fu n ct io n s a n d m a cro s b e ca u s e yo u 'll s e e t h e m o ft e n in d is cu s s io n s t h ro u g h o u t t h is b o o k.

2.5.1 The Linear Address Fields Th e fo llo win g m a cro s s im p lify Pa g e Ta b le h a n d lin g :

PAGE_SHIFT S p e cifie s t h e le n g t h in b it s o f t h e Offs e t fie ld ; wh e n a p p lie d t o 8 0 x 8 6 p ro ce s s o rs , it yie ld s t h e va lu e 1 2 . S in ce a ll t h e a d d re s s e s in a p a g e m u s t fit in t h e Offs e t fie ld , t h e s ize o f a p a g e o n 8 0 x 8 6 s ys t e m s is 2 1 2 o r t h e fa m ilia r 4 , 0 9 6 b yt e s ; t h e PAGE_SHIFT o f 1 2 ca n t h u s b e co n s id e re d t h e lo g a rit h m b a s e 2 o f t h e t o t a l p a g e s ize . Th is m a cro is u s e d b y PAGE_SIZE t o re t u rn t h e s ize o f t h e p a g e . Fin a lly, t h e

PAGE_MASK m a cro yie ld s t h e va lu e 0xfffff000 a n d is u s e d t o m a s k a ll t h e b it s o f t h e Offs e t fie ld .

PMD_SHIFT Th e t o t a l le n g t h in b it s o f t h e Mid d le Dire ct o ry a n d Ta b le fie ld s o f a lin e a r a d d re s s ; in o t h e r wo rd s , t h e lo g a rit h m o f t h e s ize o f t h e a re a a Pa g e Mid d le Dire ct o ry e n t ry ca n m a p . Th e PMD_SIZE m a cro co m p u t e s t h e s ize o f t h e a re a m a p p e d b y a s in g le e n t ry o f t h e Pa g e Mid d le Dire ct o ry — t h a t is , o f a Pa g e Ta b le . Th e PMD_MASK m a cro is u s e d t o m a s k a ll t h e b it s o f t h e Offs e t a n d Ta b le fie ld s . Wh e n PAE is d is a b le d , PMD_SHIFT yie ld s t h e va lu e 2 2 ( 1 2 fro m Offs e t p lu s 1 0 fro m Ta b le ) , PMD_SIZE yie ld s 2 2 2 o r 4 MB, a n d PMD_MASK yie ld s 0xffc00000. Co n ve rs e ly, wh e n PAE is e n a b le d , PMD_SHIFT yie ld s t h e va lu e 2 1 ( 1 2 fro m Offs e t p lu s 9 fro m Ta b le ) , PMD_SIZE yie ld s 2 2 1 o r 2 MB, a n d PMD_MASK yie ld s

0xffe00000.

PGDIR_SHIFT De t e rm in e s t h e lo g a rit h m o f t h e s ize o f t h e a re a a Pa g e Glo b a l Dire ct o ry e n t ry ca n m a p . Th e PGDIR_SIZE m a cro co m p u t e s t h e s ize o f t h e a re a m a p p e d b y a s in g le e n t ry o f t h e Pa g e Glo b a l Dire ct o ry. Th e PGDIR_MASK m a cro is u s e d t o m a s k a ll t h e b it s o f t h e Offs e t , Ta b le , a n d Mid d le Dir fie ld s . Wh e n PAE is d is a b le d , PGDIR_SHIFT yie ld s t h e va lu e 2 2 ( t h e s a m e va lu e yie ld e d b y

PMD_SHIFT) , PGDIR_SIZE yie ld s 2 2 2 o r 4 MB, a n d PGDIR_MASK yie ld s 0xffc00000. Co n ve rs e ly, wh e n PAE is e n a b le d , PGDIR_SHIFT yie ld s t h e va lu e 3 0 ( 1 2 fro m Offs e t p lu s 9 fro m Ta b le p lu s 9 fro m Mid d le Dir) , PGDIR_SIZE yie ld s 2 3 0 o r 1 GB, a n d PGDIR_MASK yie ld s 0xc0000000.

PTRS_PER_PTE, PTRS_PER_PMD, a n d PTRS_PER_PGD Co m p u t e t h e n u m b e r o f e n t rie s in t h e Pa g e Ta b le , Pa g e Mid d le Dire ct o ry, a n d Pa g e Glo b a l Dire ct o ry. Th e y yie ld t h e va lu e s 1 , 0 2 4 , 1 , a n d 1 , 0 2 4 , re s p e ct ive ly, wh e n PAE is d is a b le d , a n d t h e va lu e s 4 , 5 1 2 , a n d 5 1 2 , re s p e ct ive ly, wh e n PAE is e n a b le d

2.5.2 Page Table Handling

pte_t, pmd_t, a n d pgd_t d e s crib e t h e fo rm a t o f, re s p e ct ive ly, a Pa g e Ta b le , a Pa g e Mid d le Dire ct o ry, a n d a Pa g e Glo b a l Dire ct o ry e n t ry. Th e y a re 3 2 - b it d a t a t yp e s , e xce p t fo r pte_t, wh ich is a 6 4 - b it d a t a t yp e wh e n PAE is e n a b le d a n d a 3 2 - b it d a t a t yp e o t h e rwis e . pgprot_t is a n o t h e r 3 2 - b it d a t a t yp e t h a t re p re s e n t s t h e p ro t e ct io n fla g s a s s o cia t e d wit h a s in g le e n t ry. Fo u r t yp e - co n ve rs io n m a cro s — _ _ pte( ), _ _ pmd( ), _ _ pgd( ), a n d _ _

pgprot( ) — ca s t a u n s ig n e d in t e g e r in t o t h e re q u ire d t yp e . Fo u r o t h e r t yp e - co n ve rs io n m a cro s — pte_val( ), pmd_val( ), pgd_val( ), a n d pgprot_val( ) — p e rfo rm t h e re ve rs e ca s t in g fro m o n e o f t h e fo u r p re vio u s ly m e n t io n e d s p e cia lize d t yp e s in t o a u n s ig n e d in t e g e r. Th e ke rn e l a ls o p ro vid e s s e ve ra l m a cro s a n d fu n ct io n s t o re a d o r m o d ify Pa g e Ta b le e n t rie s : ●

Th e pte_none( ), pmd_none( ), a n d pgd_none( ) m a cro s yie ld t h e va lu e 1 if



t h e co rre s p o n d in g e n t ry h a s t h e va lu e 0 ; o t h e rwis e , t h e y yie ld t h e va lu e 0 . Th e pte_present( ), pmd_present( ), a n d pgd_present( ) m a cro s yie ld t h e va lu e 1 if t h e Present fla g o f t h e co rre s p o n d in g e n t ry is e q u a l t o 1 — t h a t is , if t h e



co rre s p o n d in g p a g e o r Pa g e Ta b le is lo a d e d in m a in m e m o ry. Th e pte_clear( ), pmd_clear( ), a n d pgd_clear( ) m a cro s cle a r a n e n t ry o f t h e co rre s p o n d in g Pa g e Ta b le , t h u s fo rb id d in g a p ro ce s s t o u s e t h e lin e a r a d d re s s e s m a p p e d b y t h e Pa g e Ta b le e n t ry.

Th e m a cro s pmd_bad( ) a n d pgd_bad( ) a re u s e d b y fu n ct io n s t o ch e ck Pa g e Glo b a l Dire ct o ry a n d Pa g e Mid d le Dire ct o ry e n t rie s p a s s e d a s in p u t p a ra m e t e rs . Ea ch m a cro yie ld s t h e va lu e 1 if t h e e n t ry p o in t s t o a b a d Pa g e Ta b le — t h a t is , if a t le a s t o n e o f t h e fo llo win g co n d it io n s a p p lie s :



Th e p a g e is n o t in m a in m e m o ry ( Present fla g cle a re d ) .



Th e p a g e a llo ws o n ly Re a d a cce s s ( Read/Write fla g cle a re d ) .



Eit h e r Accessed o r Dirty is cle a re d ( Lin u x a lwa ys fo rce s t h e s e fla g s t o b e s e t fo r e ve ry e xis t in g Pa g e Ta b le ) .

No pte_bad( ) m a cro is d e fin e d b e ca u s e it is le g a l fo r a Pa g e Ta b le e n t ry t o re fe r t o a p a g e t h a t is n o t p re s e n t in m a in m e m o ry, n o t writ a b le , o r n o t a cce s s ib le a t a ll. In s t e a d , s e ve ra l fu n ct io n s a re o ffe re d t o q u e ry t h e cu rre n t va lu e o f a n y o f t h e fla g s in clu d e d in a Pa g e Ta b le e n t ry:

pte_read( ) Re t u rn s t h e va lu e o f t h e User/Supervisor fla g ( in d ica t in g wh e t h e r t h e p a g e is a cce s s ib le in Us e r Mo d e ) .

pte_write( ) Re t u rn s 1 if t h e Read/Write fla g is s e t ( in d ica t in g wh e t h e r t h e p a g e is writ a b le ) .

pte_exec( ) Re t u rn s t h e va lu e o f t h e User/Supervisor fla g ( in d ica t in g wh e t h e r t h e p a g e is a cce s s ib le in Us e r Mo d e ) . No t ice t h a t p a g e s o n t h e 8 0 x 8 6 p ro ce s s o r ca n n o t b e p ro t e ct e d a g a in s t co d e e xe cu t io n .

pte_dirty( ) Re t u rn s t h e va lu e o f t h e Dirty fla g ( in d ica t in g wh e t h e r t h e p a g e h a s b e e n m o d ifie d ) .

pte_young( ) Re t u rn s t h e va lu e o f t h e Accessed fla g ( in d ica t in g wh e t h e r t h e p a g e h a s b e e n a cce s s e d ) . An o t h e r g ro u p o f fu n ct io n s s e t s t h e va lu e o f t h e fla g s in a Pa g e Ta b le e n t ry:

pte_wrprotect( ) Cle a rs t h e Read/Write fla g

pte_rdprotect a n d pte_exprotect( ) Cle a r t h e User/Supervisor fla g

pte_mkwrite( )

S e t s t h e Read/Write fla g

pte_mkread( ) a n d pte_mkexec( ) S e t t h e User/Supervisor fla g

pte_mkdirty( ) a n d pte_mkclean( ) S e t t h e Dirty fla g t o 1 a n d t o 0 , re s p e ct ive ly, m a rkin g t h e p a g e a s m o d ifie d o r u n m o d ifie d

pte_mkyoung( ) a n d pte_mkold( ) S e t t h e Accessed fla g t o 1 a n d t o 0 , re s p e ct ive ly, m a rkin g t h e p a g e a s a cce s s e d ( yo u n g ) o r n o n a cce s s e d ( o ld )

pte_modify(p,v) S e t s a ll a cce s s rig h t s in a Pa g e Ta b le e n t ry p t o a s p e cifie d va lu e v

set_pte, set_pmd, a n d set_pgd Writ e s a s p e cifie d va lu e in t o a Pa g e Ta b le , Pa g e Mid d le Dire ct o ry, a n d Pa g e Glo b a l Dire ct o ry e n t ry, re s p e ct ive ly Th e ptep_set_wrprotect( ) a n d ptep_mkdirty( ) fu n ct io n s a re s im ila r t o

pte_wrprotect( ) a n d pte_mkdirty( ), re s p e ct ive ly, e xce p t t h a t t h e y a ct o n p o in t e rs t o a Pa g e Ta b le e n t ry. Th e ptep_test_and_clear_dirty( ) a n d ptep_test_and_clear_young( ) fu n ct io n s a ls o a ct o n p o in t e rs a n d a re s im ila r t o pte_mkclean( ) a n d pte_mkold( ), re s p e ct ive ly, e xce p t t h a t t h e y re t u rn t h e o ld va lu e o f t h e fla g . No w co m e t h e m a cro s t h a t co m b in e a p a g e a d d re s s a n d a g ro u p o f p ro t e ct io n fla g s in t o a p a g e e n t ry o r p e rfo rm t h e re ve rs e o p e ra t io n o f e xt ra ct in g t h e p a g e a d d re s s fro m a Pa g e Ta b le e n t ry:

mk_ pte Acce p t s a lin e a r a d d re s s a n d a g ro u p o f a cce s s rig h t s a s a rg u m e n t s a n d cre a t e s a Pa g e Ta b le e n t ry.

mk_ pte_ phys Cre a t e s a Pa g e Ta b le e n t ry b y co m b in in g t h e p h ys ica l a d d re s s a n d t h e a cce s s rig h t s of the pa ge .

pte_ page( ) Re t u rn s t h e a d d re s s o f t h e d e s crip t o r o f t h e p a g e fra m e re fe re n ce d b y a Pa g e Ta b le e n t ry ( s e e S e ct io n 7 . 1 . 1 ) .

pmd_ page( ) Re t u rn s t h e lin e a r a d d re s s o f a Pa g e Ta b le fro m it s Pa g e Mid d le Dire ct o ry e n t ry.

pgd_offset(p,a) Re ce ive s a s p a ra m e t e rs a m e m o ry d e s crip t o r p ( s e e Ch a p t e r 8 ) a n d a lin e a r a d d re s s

a. Th e m a cro yie ld s t h e a d d re s s o f t h e e n t ry in a Pa g e Glo b a l Dire ct o ry t h a t co rre s p o n d s t o t h e a d d re s s a; t h e Pa g e Glo b a l Dire ct o ry is fo u n d t h ro u g h a p o in t e r wit h in t h e m e m o ry d e s crip t o r p. Th e pgd_offset_k( ) m a cro is s im ila r, e xce p t t h a t it re fe rs t o t h e m a s t e r ke rn e l Pa g e Ta b le s ( s e e t h e la t e r s e ct io n S e ct io n 2 . 5 . 5 ) .

pmd_offset(p,a) Re ce ive s a s a p a ra m e t e r a Pa g e Glo b a l Dire ct o ry e n t ry p a n d a lin e a r a d d re s s a; it yie ld s t h e a d d re s s o f t h e e n t ry co rre s p o n d in g t o t h e a d d re s s a in t h e Pa g e Mid d le Dire ct o ry re fe re n ce d b y p.

pte_offset(p,a) S im ila r t o pmd_offset, b u t p is a Pa g e Mid d le Dire ct o ry e n t ry a n d t h e m a cro yie ld s t h e a d d re s s o f t h e e n t ry co rre s p o n d in g t o a in t h e Pa g e Ta b le re fe re n ce d b y p.

Th e la s t g ro u p o f fu n ct io n s o f t h is lo n g lis t we re in t ro d u ce d t o s im p lify t h e cre a t io n a n d d e le t io n o f Pa g e Ta b le e n t rie s . Wh e n t wo - le ve l p a g in g is u s e d ( a n d PAE is d is a b le d ) , cre a t in g o r d e le t in g a Pa g e Mid d le Dire ct o ry e n t ry is t rivia l. As we e xp la in e d e a rlie r in t h is s e ct io n , t h e Pa g e Mid d le Dire ct o ry co n t a in s a s in g le e n t ry t h a t p o in t s t o t h e s u b o rd in a t e Pa g e Ta b le . Th u s , t h e Pa g e Mid d le Dire ct o ry e n t ry is t h e e n t ry wit h in t h e Pa g e Glo b a l Dire ct o ry t o o . Wh e n d e a lin g wit h Pa g e Ta b le s , h o we ve r, cre a t in g a n e n t ry m a y b e m o re co m p le x b e ca u s e t h e Pa g e Ta b le t h a t is s u p p o s e d t o co n t a in it m ig h t n o t e xis t . In s u ch ca s e s , it is n e ce s s a ry t o a llo ca t e a n e w p a g e fra m e , fill it wit h ze ro s , a n d a d d t h e e n t ry. If PAE is e n a b le d , t h e ke rn e l u s e s t h re e - le ve l p a g in g . Wh e n t h e ke rn e l cre a t e s a n e w Pa g e Glo b a l Dire ct o ry, it a ls o a llo ca t e s t h e fo u r co rre s p o n d in g Pa g e Mid d le Dire ct o rie s ; t h e s e a re fre e d o n ly wh e n t h e p a re n t Pa g e Glo b a l Dire ct o ry is re le a s e d . As we s h a ll s e e in S e ct io n 7 . 1 , t h e a llo ca t io n s a n d d e a llo ca t io n s o f p a g e fra m e s a re e xp e n s ive o p e ra t io n s . Th e re fo re , wh e n t h e ke rn e l d e s t ro ys a Pa g e Ta b le , it m a ke s s e n s e t o a d d t h e co rre s p o n d in g p a g e fra m e t o a s u it a b le m e m o ry ca ch e . Lin u x 2 . 4 . 1 8 a lre a d y in clu d e s s o m e fu n ct io n s a n d d a t a s t ru ct u re s , s u ch a s pte_quicklist o r pgd_quicklist, t o im p le m e n t s u ch ca ch e ; h o we ve r, t h e co d e is n o t m a t u re a n d t h e ca ch e is n o t u s e d ye t .

No w co m e s t h e la s t ro u n d o f fu n ct io n s a n d m a cro s . As u s u a l, we 'll s t ick t o t h e 8 0 x 8 6 a rch it e ct u re .

pgd_alloc( m ) Allo ca t e s a n e w Pa g e Glo b a l Dire ct o ry b y in vo kin g t h e get_ pgd_slow( ) fu n ct io n . If PAE is e n a b le d , t h e la t t e r fu n ct io n a ls o a llo ca t e s t h e fo u r ch ild re n Pa g e Mid d le Dire ct o rie s . Th e a rg u m e n t m ( t h e a d d re s s o f a m e m o ry d e s crip t o r) is ig n o re d o n t h e 8 0 x 8 6 a rch it e ct u re .

pmd_alloc(m,p,a) De fin e d s o t h re e - le ve l p a g in g s ys t e m s ca n a llo ca t e a n e w Pa g e Mid d le Dire ct o ry fo r t h e lin e a r a d d re s s a. If PAE is n o t e n a b le d , t h e fu n ct io n s im p ly re t u rn s t h e in p u t p a ra m e t e r p — t h a t is , t h e a d d re s s o f t h e e n t ry in t h e Pa g e Glo b a l Dire ct o ry. If PAE is e n a b le d , t h e fu n ct io n re t u rn s t h e a d d re s s o f t h e Pa g e Mid d le Dire ct o ry t h a t wa s a llo ca t e d wh e n t h e Pa g e Glo b a l Dire ct o ry wa s cre a t e d . Th e a rg u m e n t m is ig n o re d .

pte_alloc(m,p,a) Re ce ive s a s p a ra m e t e rs t h e a d d re s s o f a Pa g e Mid d le Dire ct o ry e n t ry p a n d a lin e a r a d d re s s a, a n d re t u rn s t h e a d d re s s o f t h e Pa g e Ta b le e n t ry co rre s p o n d in g t o a. If t h e Pa g e Mid d le Dire ct o ry e n t ry is n u ll, t h e fu n ct io n m u s t a llo ca t e a n e w Pa g e Ta b le . Th e p a g e fra m e is a llo ca t e d b y in vo kin g pte_alloc_one( ). If a n e w Pa g e Ta b le is a llo ca t e d , t h e e n t ry co rre s p o n d in g t o a is in it ia lize d a n d t h e User/Supervisor fla g is s e t . Th e a rg u m e n t m is ig n o re d .

pte_free( ) a n d pgd_free( ) Re le a s e a Pa g e Ta b le . Th e pmd_free( ) fu n ct io n d o e s n o t h in g , s in ce Pa g e Mid d le Dire ct o rie s a re a llo ca t e d a n d d e a llo ca t e d t o g e t h e r wit h t h e ir p a re n t Pa g e Glo b a l Dire ct o ry.

free_one_pmd( ) In vo ke s pte_free( ) t o re le a s e a Pa g e Ta b le a n d s e t s t h e co rre s p o n d in g e n t ry in t h e Pa g e Mid d le Dire ct o ry t o NULL.

free_one_ pgd( ) Re le a s e s a ll Pa g e Ta b le s o f a Pa g e Mid d le Dire ct o ry b y u s in g free_one_ pmd( ) re p e a t e d ly. Th e n it re le a s e s t h e Pa g e Mid d le Dire ct o ry b y in vo kin g pmd_free( ).

clear_ page_tables( ) Cle a rs t h e co n t e n t s o f t h e Pa g e Ta b le s o f a p ro ce s s b y it e ra t ive ly in vo kin g free_one_ pgd( ).

2.5.3 Reserved Page Frames Th e ke rn e l's co d e a n d d a t a s t ru ct u re s a re s t o re d in a g ro u p o f re s e rve d p a g e fra m e s . A p a g e co n t a in e d in o n e o f t h e s e p a g e fra m e s ca n n e ve r b e d yn a m ica lly a s s ig n e d o r s wa p p e d t o d is k. As a g e n e ra l ru le , t h e Lin u x ke rn e l is in s t a lle d in RAM s t a rt in g fro m t h e p h ys ica l a d d re s s 0x00100000 — i. e . , fro m t h e s e co n d m e g a b yt e . Th e t o t a l n u m b e r o f p a g e fra m e s re q u ire d d e p e n d s o n h o w t h e ke rn e l is co n fig u re d . A t yp ica l co n fig u ra t io n yie ld s a ke rn e l t h a t ca n b e lo a d e d in le s s t h a n 2 MBs o f RAM. Wh y is n 't t h e ke rn e l lo a d e d s t a rt in g wit h t h e firs t a va ila b le m e g a b yt e o f RAM? We ll, t h e PC a rch it e ct u re h a s s e ve ra l p e cu lia rit ie s t h a t m u s t b e t a ke n in t o a cco u n t . Fo r e xa m p le : ●





Pa g e fra m e 0 is u s e d b y BIOS t o s t o re t h e s ys t e m h a rd wa re co n fig u ra t io n d e t e ct e d d u rin g t h e Po w e r- On S e lf- Te s t ( POS T ) ; t h e BIOS o f m a n y la p t o p s , m o re o ve r, writ e d a t a o n t h is p a g e fra m e e ve n a ft e r t h e s ys t e m is in it ia lize d . Ph ys ica l a d d re s s e s ra n g in g fro m 0x000a0000 t o 0x000fffff a re u s u a lly re s e rve d t o BIOS ro u t in e s a n d t o m a p t h e in t e rn a l m e m o ry o f IS A g ra p h ics ca rd s . Th is a re a is t h e we ll- kn o wn h o le fro m 6 4 0 KB t o 1 MB in a ll IBM- co m p a t ib le PCs : t h e p h ys ica l a d d re s s e s e xis t b u t t h e y a re re s e rve d , a n d t h e co rre s p o n d in g p a g e fra m e s ca n n o t b e u s e d b y t h e o p e ra t in g s ys t e m . Ad d it io n a l p a g e fra m e s wit h in t h e firs t m e g a b yt e m a y b e re s e rve d b y s p e cific co m p u t e r m o d e ls . Fo r e xa m p le , t h e IBM Th in kPa d m a p s t h e 0xa0 p a g e fra m e in t o t h e 0x9f o n e .

In t h e e a rly s t a g e o f t h e b o o t s e q u e n ce ( s e e Ap p e n d ix A) , t h e ke rn e l q u e rie s t h e BIOS a n d le a rn s t h e s ize o f t h e p h ys ica l m e m o ry. In re ce n t co m p u t e rs , t h e ke rn e l a ls o in vo ke s a BIOS p ro ce d u re t o b u ild a lis t o f p h ys ica l a d d re s s ra n g e s a n d t h e ir co rre s p o n d in g m e m o ry t yp e s . La t e r, t h e ke rn e l e xe cu t e s t h e setup_memory_region( ) fu n ct io n , wh ich fills a t a b le o f p h ys ica l m e m o ry re g io n s , s h o wn in Ta b le 2 - 1 . Of co u rs e , t h e ke rn e l b u ild s t h is t a b le o n t h e b a s is o f t h e BIOS lis t , if t h is is a va ila b le ; o t h e rwis e t h e ke rn e l b u ild s t h e t a b le fo llo win g t h e co n s e rva t ive d e fa u lt s e t u p . All p a g e fra m e s wit h n u m b e rs fro m 0x9f ( LOWMEMSIZE( )) t o

0x100 ( HIGH_MEMORY) a re m a rke d a s re s e rve d .

Ta b le 2 - 1 . Ex a m p le o f BI OS - p ro v id e d p h y s ic a l a d d re s s e s m a p

S t a rt

En d

Ty p e

0x00000000

0x0009ffff

Us a b le

0x000f0000

0x000fffff

Re s e rve d

0x00100000

0x07feffff

Us a b le

0x07ff0000

0x07ff2fff

ACPI d a t a

0x07ff3000

0x07ffffff

ACPI NVS

0xffff0000

0xffffffff

Re s e rve d

A t yp ica l co n fig u ra t io n fo r a co m p u t e r h a vin g 1 2 8 MB o f RAM is s h o wn in Ta b le 2 - 1 . Th e p h ys ica l a d d re s s ra n g e fro m 0x07ff0000 t o 0x07ff2fff s t o re s in fo rm a t io n a b o u t t h e h a rd wa re d e vice s o f t h e s ys t e m writ t e n b y t h e BIOS in t h e POS T p h a s e ; d u rin g t h e in it ia liza t io n p h a s e , t h e ke rn e l co p ie s s u ch in fo rm a t io n in a s u it a b le ke rn e l d a t a s t ru ct u re , a n d t h e n co n s id e rs t h e s e p a g e fra m e s u s a b le . Co n ve rs e ly, t h e p h ys ica l a d d re s s ra n g e o f 0x07ff3000 t o 0x07ffffff is m a p p e d o n ROM ch ip s o f t h e h a rd wa re d e vice s . Th e p h ys ica l a d d re s s ra n g e s t a rt in g fro m 0xffff0000 is m a rke d a s re s e rve d s in ce it is m a p p e d b y t h e h a rd wa re t o t h e BIOS 's ROM ch ip ( s e e Ap p e n d ix A) . No t ice t h a t t h e BIOS m a y n o t p ro vid e in fo rm a t io n fo r s o m e p h ys ica l a d d re s s ra n g e s ( in t h e t a b le , t h e ra n g e is 0x000a0000 t o

0x000effff) . To b e o n t h e s a fe s id e , Lin u x a s s u m e s t h a t s u ch ra n g e s a re n o t u s a b le . To a vo id lo a d in g t h e ke rn e l in t o g ro u p s o f n o n co n t ig u o u s p a g e fra m e s , Lin u x p re fe rs t o s kip t h e firs t m e g a b yt e o f RAM. Cle a rly, p a g e fra m e s n o t re s e rve d b y t h e PC a rch it e ct u re will b e u s e d b y Lin u x t o s t o re d yn a m ica lly a s s ig n e d p a g e s . Fig u re 2 - 1 2 s h o ws h o w t h e firs t 2 MB o f RAM a re fille d b y Lin u x. We h a ve a s s u m e d t h a t t h e ke rn e l re q u ire s le s s t h a n o n e m e g a b yt e o f RAM ( t h is is a b it o p t im is t ic) . Fig u re 2 - 1 2 . Th e firs t 5 1 2 p a g e fra m e s ( 2 MB) in Lin u x 2 . 4

Th e s ym b o l _text, wh ich co rre s p o n d s t o p h ys ica l a d d re s s 0x00100000, d e n o t e s t h e a d d re s s o f t h e firs t b yt e o f ke rn e l co d e . Th e e n d o f t h e ke rn e l co d e is s im ila rly id e n t ifie d b y t h e s ym b o l _etext. Ke rn e l d a t a is d ivid e d in t o t wo g ro u p s : in it ia liz e d a n d u n in it ia liz e d . Th e in it ia lize d d a t a s t a rt s rig h t a ft e r _etext a n d e n d s a t _edata. Th e u n in it ia lize d d a t a fo llo ws a n d e n d s u p a t _end.

Th e s ym b o ls a p p e a rin g in t h e fig u re a re n o t d e fin e d in Lin u x s o u rce co d e ; t h e y a re p ro d u ce d wh ile co m p ilin g t h e ke rn e l. [ 3 ]

[3]

Yo u ca n fin d t h e lin e a r a d d re s s o f t h e s e s ym b o ls in t h e file S y s t e m . m a p , wh ich is cre a t e d rig h t a ft e r t h e ke rn e l is co m p ile d .

2.5.4 Process Page Tables Th e lin e a r a d d re s s s p a ce o f a p ro ce s s is d ivid e d in t o t wo p a rt s : ●

Lin e a r a d d re s s e s fro m 0x00000000 t o 0xbfffffff ca n b e a d d re s s e d wh e n t h e



p ro ce s s is in e it h e r Us e r o r Ke rn e l Mo d e . Lin e a r a d d re s s e s fro m 0xc0000000 t o 0xffffffff ca n b e a d d re s s e d o n ly wh e n t h e p ro ce s s is in Ke rn e l Mo d e .

Wh e n a p ro ce s s ru n s in Us e r Mo d e , it is s u e s lin e a r a d d re s s e s s m a lle r t h a n 0xc0000000; wh e n it ru n s in Ke rn e l Mo d e , it is e xe cu t in g ke rn e l co d e a n d t h e lin e a r a d d re s s e s is s u e d a re g re a t e r t h a n o r e q u a l t o 0xc0000000. In s o m e ca s e s , h o we ve r, t h e ke rn e l m u s t a cce s s t h e Us e r Mo d e lin e a r a d d re s s s p a ce t o re t rie ve o r s t o re d a t a . Th e PAGE_OFFSET m a cro yie ld s t h e va lu e 0xc0000000; t h is is t h e o ffs e t in t h e lin e a r a d d re s s s p a ce o f a p ro ce s s wh e re t h e ke rn e l live s . In t h is b o o k, we o ft e n re fe r d ire ct ly t o t h e n u m b e r 0xc0000000 in s t e a d .

Th e co n t e n t o f t h e firs t e n t rie s o f t h e Pa g e Glo b a l Dire ct o ry t h a t m a p lin e a r a d d re s s e s lo we r t h a n 0xc0000000 ( t h e firs t 7 6 8 e n t rie s wit h PAE d is a b le d ) d e p e n d s o n t h e s p e cific p ro ce s s . Co n ve rs e ly, t h e re m a in in g e n t rie s s h o u ld b e t h e s a m e fo r a ll p ro ce s s e s a n d e q u a l t o t h e co rre s p o n d in g e n t rie s o f t h e ke rn e l m a s t e r Pa g e Glo b a l Dire ct o ry ( s e e t h e fo llo win g s e ct io n ) .

2.5.5 Kernel Page Tables Th e ke rn e l m a in t a in s a s e t o f Pa g e Ta b le s fo r it s o wn u s e , ro o t e d a t a s o - ca lle d m a s t e r k e rn e l Pa g e Glo b a l Dire ct o ry . Aft e r s ys t e m in it ia liza t io n , t h is s e t o f Pa g e Ta b le s a re n e ve r d ire ct ly u s e d b y a n y p ro ce s s o r ke rn e l t h re a d ; ra t h e r, t h e h ig h e s t e n t rie s o f t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry a re t h e re fe re n ce m o d e l fo r t h e co rre s p o n d in g e n t rie s o f t h e Pa g e Glo b a l Dire ct o rie s o f e ve ry re g u la r p ro ce s s in t h e s ys t e m . We e xp la in h o w t h e ke rn e l e n s u re s t h a t ch a n g e s t o t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry a re p ro p a g a t e d t o t h e Pa g e Glo b a l Dire ct o rie s t h a t a re a ct u a lly u s e d b y t h e p ro ce s s e s in t h e s ys t e m in S e ct io n 8 . 4 . 5 . We n o w d e s crib e h o w t h e ke rn e l in it ia lize s it s o wn Pa g e Ta b le s . Th is is a t wo - p h a s e a ct ivit y. In fa ct , rig h t a ft e r t h e ke rn e l im a g e is lo a d e d in t o m e m o ry, t h e CPU is s t ill ru n n in g in re a l m o d e ; t h u s , p a g in g is n o t e n a b le d . In t h e firs t p h a s e , t h e ke rn e l cre a t e s a lim it e d 8 MB a d d re s s s p a ce , wh ich is e n o u g h fo r it t o in s t a ll it s e lf in RAM. In t h e s e co n d p h a s e , t h e ke rn e l t a ke s a d va n t a g e o f a ll o f t h e e xis t in g RAM a n d s e t s u p t h e p a g in g t a b le s p ro p e rly. Th e n e xt s e ct io n s e xa m in e h o w t h is p la n is e xe cu t e d .

2.5.5.1 Provisional kernel Page Tables

A p ro vis io n a l Pa g e Glo b a l Dire ct o ry is in it ia lize d s t a t ica lly d u rin g ke rn e l co m p ila t io n , wh ile t h e p ro vis io n a l Pa g e Ta b le s a re in it ia lize d b y t h e startup_32( ) a s s e m b ly la n g u a g e fu n ct io n d e fin e d in a rch / i3 8 6 / k e rn e l/ h e a d . S . We wo n 't b o t h e r m e n t io n in g t h e Pa g e Mid d le Dire ct o rie s a n ym o re s in ce t h e y a re e q u a t e d t o Pa g e Glo b a l Dire ct o ry e n t rie s . PAE s u p p o rt is n o t e n a b le d a t t h is s t a g e . Th e Pa g e Glo b a l Dire ct o ry is co n t a in e d in t h e swapper_pg_dir va ria b le , wh ile t h e t wo Pa g e Ta b le s t h a t s p a n t h e firs t 8 MB o f RAM a re co n t a in e d in t h e pg0 a n d pg1 va ria b le s .

Th e o b je ct ive o f t h is firs t p h a s e o f p a g in g is t o a llo w t h e s e 8 MB t o b e e a s ily a d d re s s e d b o t h in re a l m o d e a n d p ro t e ct e d m o d e . Th e re fo re , t h e ke rn e l m u s t cre a t e a m a p p in g fro m b o t h t h e lin e a r a d d re s s e s 0x00000000 t h ro u g h 0x007fffff a n d t h e lin e a r a d d re s s e s

0xc0000000 t h ro u g h 0xc07fffff in t o t h e p h ys ica l a d d re s s e s 0x00000000 t h ro u g h 0x007fffff. In o t h e r wo rd s , t h e ke rn e l d u rin g it s firs t p h a s e o f in it ia liza t io n ca n a d d re s s t h e firs t 8 MB o f RAM b y e it h e r lin e a r a d d re s s e s id e n t ica l t o t h e p h ys ica l o n e s o r 8 MB wo rt h o f lin e a r a d d re s s e s , s t a rt in g fro m 0xc0000000.

Th e ke rn e l cre a t e s t h e d e s ire d m a p p in g b y fillin g a ll t h e swapper_pg_dir e n t rie s wit h ze ro e s , e xce p t fo r e n t rie s 0 , 1 , 0 x 3 0 0 ( d e cim a l 7 6 8 ) , a n d 0 x 3 0 1 ( d e cim a l 7 6 9 ) ; t h e la t t e r t wo e n t rie s s p a n a ll lin e a r a d d re s s e s b e t we e n 0xc0000000 a n d 0xc07fffff. Th e 0 , 1 , 0 x 3 0 0 , a n d 0 x 3 0 1 e n t rie s a re in it ia lize d a s fo llo ws : ●

Th e a d d re s s fie ld o f e n t rie s 0 a n d 0 x 3 0 0 is s e t t o t h e p h ys ica l a d d re s s o f pg0, wh ile t h e a d d re s s fie ld o f e n t rie s 1 a n d 0 x 3 0 1 is s e t t o t h e p h ys ica l a d d re s s o f pg1.



Th e Present, Read/Write, a n d User/Supervisor fla g s a re s e t in a ll fo u r e n t rie s .



Th e Accessed, Dirty, PCD, PWD, a n d Page Size fla g s a re cle a re d in a ll fo u r e n t rie s .

Th e startup_32( ) a s s e m b ly la n g u a g e fu n ct io n a ls o e n a b le s t h e p a g in g u n it . Th is is a ch ie ve d b y lo a d in g t h e p h ys ica l a d d re s s o f swapper_pg_dir in t o t h e cr3 co n t ro l re g is t e r a n d b y s e t t in g t h e PG fla g o f t h e cr0 co n t ro l re g is t e r, a s s h o wn in t h e fo llo win g e q u iva le n t co d e fra g m e n t :

movl $swapper_pg_dir-0xc0000000,%eax movl %eax,%cr3 /* set the page table pointer.. */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* ..and set paging (PG) bit */ 2.5.5.2 Final kernel Page Table when RAM size is less than 896 MB Th e fin a l m a p p in g p ro vid e d b y t h e ke rn e l Pa g e Ta b le s m u s t t ra n s fo rm lin e a r a d d re s s e s s t a rt in g fro m 0xc0000000 in t o p h ys ica l a d d re s s e s s t a rt in g fro m 0 .

Th e _ pa m a cro is u s e d t o co n ve rt a lin e a r a d d re s s s t a rt in g fro m PAGE_OFFSET t o t h e co rre s p o n d in g p h ys ica l a d d re s s , wh ile t h e _va m a cro d o e s t h e re ve rs e .

Th e k e rn e l m a s t e r Pa g e Glo b a l Dire ct o ry is s t ill s t o re d in swapper_pg_dir. It is in it ia lize d b y t h e paging_init( ) fu n ct io n , wh ich d o e s t h e fo llo win g :

1 . In vo ke s pagetable_init( ) t o s e t u p t h e Pa g e Ta b le e n t rie s p ro p e rly

2 . Writ e s t h e p h ys ica l a d d re s s o f swapper_pg_dir in t h e cr3 co n t ro l re g is t e r

3 . In vo ke s flush_tlb_all( ) t o in va lid a t e a ll TLB e n t rie s

Th e a ct io n s p e rfo rm e d b y pagetable_init( ) d e p e n d o n b o t h t h e a m o u n t o f RAM p re s e n t a n d o n t h e CPU m o d e l. Le t 's s t a rt wit h t h e s im p le s t ca s e . Ou r co m p u t e r h a s le s s t h a n 8 9 6 MB[ 4 ] o f RAM, 3 2 - b it p h ys ica l a d d re s s e s a re s u fficie n t t o a d d re s s a ll t h e a va ila b le RAM, a n d t h e re is n o n e e d t o a ct iva t e t h e PAE m e ch a n is m . ( S e e t h e e a rlie r s e ct io n S e ct io n 2 . 4 . 6 . ) [4]

Th e h ig h e s t 1 2 8 MB o f lin e a r a d d re s s e s a re le ft a va ila b le fo r s e ve ra l kin d s o f m a p p in g s ( s e e s e ct io n s S e ct io n 2 . 5 . 6 la t e r in t h is ch a p t e r a n d S e ct io n 7 . 3 ) . Th e ke rn e l a d d re s s s p a ce le ft fo r m a p p in g t h e RAM is t h u s 1 GB - 1 2 8 MB = 8 9 6 MB. Th e swapper_pg_dir Pa g e Glo b a l Dire ct o ry is re in it ia lize d b y a cycle e q u iva le n t t o t h e fo llo win g :

pgd = swapper_pg_dir + 768; address = 0xc0000000; while (address < end) { pe = _PAGE_PRESENT + _PAGE_RW + _PAGE_ACCESSED + _PAGE_DIRTY + _PAGE_PSE + _PAGE_GLOBAL + _ _pa(address); set_pgd(pgd, _ _pgd(pe)); ++pgd; address += 0x400000; } Th e end va ria b le s t o re s t h e lin e a r a d d re s s in t h e fo u rt h g ig a b yt e co rre s p o n d in g t o t h e e n d o f u s a b le p h ys ica l m e m o ry. We a s s u m e t h a t t h e CPU is a re ce n t 8 0 x 8 6 m icro p ro ce s s o r s u p p o rt in g 4 MB p a g e s a n d "g lo b a l" TLB e n t rie s . No t ice t h a t t h e User/Supervisor fla g s in a ll Pa g e Glo b a l Dire ct o ry e n t rie s re fe re n cin g lin e a r a d d re s s e s a b o ve 0xc0000000 a re cle a re d , t h u s d e n yin g p ro ce s s e s in Us e r Mo d e a cce s s t o t h e ke rn e l a d d re s s s p a ce . Th e id e n t it y m a p p in g o f t h e firs t 8 MB o f p h ys ica l m e m o ry b u ilt b y t h e startup_32( ) fu n ct io n is re q u ire d t o co m p le t e t h e in it ia liza t io n p h a s e o f t h e ke rn e l. Wh e n t h is m a p p in g is n o lo n g e r n e ce s s a ry, t h e ke rn e l cle a rs t h e co rre s p o n d in g Pa g e Ta b le e n t rie s b y in vo kin g t h e zap_low_mappings( ) fu n ct io n .

Act u a lly, t h is d e s crip t io n d o e s n o t s t a t e t h e wh o le t ru t h . As we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 2 . 5 . 6 , t h e ke rn e l a ls o a d ju s t s t h e e n t rie s o f Pa g e Ta b le s co rre s p o n d in g t o t h e "fixm a p p e d lin e a r a d d re s s e s . "

2.5.5.3 Final kernel Page Table when RAM size is between 896 MB and 4096 MB In t h is ca s e , t h e RAM ca n n o t b e m a p p e d e n t ire ly in t o t h e ke rn e l lin e a r a d d re s s s p a ce . Th e b e s t Lin u x ca n d o d u rin g t h e in it ia liza t io n p h a s e is t o m a p a RAM win d o w h a vin g s ize o f 8 9 6 MB in t o t h e ke rn e l lin e a r a d d re s s s p a ce . If a p ro g ra m n e e d s t o a d d re s s o t h e r p a rt s o f t h e

e xis t in g RAM, s o m e o t h e r lin e a r a d d re s s in t e rva l m u s t b e m a p p e d t o t h e re q u ire d RAM. Th is im p lie s ch a n g in g t h e va lu e o f s o m e Pa g e Ta b le e n t rie s . We 'll d e fe r d is cu s s in g h o w t h is kin d o f d yn a m ic re m a p p in g is d o n e in Ch a p t e r 7 . To in it ia lize t h e Pa g e Glo b a l Dire ct o ry, t h e ke rn e l u s e s t h e s a m e co d e a s in p re vio u s ca s e .

2.5.5.4 Final kernel Page Table when RAM size is more than 4096 MB Le t 's n o w co n s id e r ke rn e l Pa g e Ta b le in it ia liza t io n fo r co m p u t e rs wit h m o re t h a n 4 GB; m o re p re cis e ly, we d e a l wit h ca s e s in wh ich t h e fo llo win g h a p p e n s : ● ● ●

Th e CPU m o d e l s u p p o rt s Ph ys ica l Ad d re s s Ext e n s io n ( PAE) . Th e a m o u n t o f RAM is la rg e r t h a n 4 GB. Th e ke rn e l is co m p ile d wit h PAE s u p p o rt .

Alt h o u g h PAE h a n d le s 3 6 - b it p h ys ica l a d d re s s e s , lin e a r a d d re s s e s a re s t ill 3 2 - b it a d d re s s e s . As in t h e p re vio u s ca s e , Lin u x m a p s a 8 9 6 - MB RAM win d o w in t o t h e ke rn e l lin e a r a d d re s s s p a ce ; t h e re m a in in g RAM is le ft u n m a p p e d a n d h a n d le d b y d yn a m ic re m a p p in g , a s d e s crib e d in Ch a p t e r 7 . Th e m a in d iffe re n ce wit h t h e p re vio u s ca s e is t h a t a t h re e - le ve l p a g in g m o d e l is u s e d , s o t h e Pa g e Glo b a l Dire ct o ry is in it ia lize d a s fo llo ws :

for (i = 0; i < 3; i++) set_pgd(swapper_pg_dir + i, _ _pgd(1 + _ _pa(empty_zero_page))); pgd = swapper_pg_dir + 3; address = 0xc0000000; set_pgd(pgd, _ _pgd(_ _pa(pmd) + 0x1)); while (address < 0xe8000000) { pe = _PAGE_PRESENT + _PAGE_RW + _PAGE_ACCESSED + _PAGE_DIRTY + _PAGE_PSE + _PAGE_GLOBAL + _ _pa(address); set_pmd(pmd, _ _pmd(pe)); pmd++; address += 0x200000; } pgd_base[0] = pgd_base[3]; Th e ke rn e l in it ia lize s t h e firs t t h re e e n t rie s in t h e Pa g e Glo b a l Dire ct o ry co rre s p o n d in g t o t h e u s e r lin e a r a d d re s s s p a ce wit h t h e a d d re s s o f a n e m p t y p a g e ( empty_zero_page) . Th e fo u rt h e n t ry is in it ia lize d wit h t h e a d d re s s o f a Pa g e Mid d le Dire ct o ry ( pmd) . Th e firs t 4 4 8 e n t rie s in t h e Pa g e Mid d le Dire ct o ry ( t h e re a re 5 1 2 e n t rie s , b u t t h e la s t 6 4 a re re s e rve d fo r n o n co n t ig u o u s m e m o ry a llo ca t io n ) a re fille d wit h t h e p h ys ica l a d d re s s o f t h e firs t 8 9 6 MB o f RAM. No t ice t h a t a ll CPU m o d e ls t h a t s u p p o rt PAE a ls o s u p p o rt la rg e 2 MB p a g e s a n d g lo b a l p a g e s . As in t h e p re vio u s ca s e , wh e n e ve r p o s s ib le , Lin u x u s e s la rg e p a g e s t o re d u ce t h e n u m b e r o f Pa g e Ta b le s .

2.5.6 Fix-Mapped Linear Addresses We s a w t h a t t h e in it ia l p a rt o f t h e fo u rt h g ig a b yt e o f ke rn e l lin e a r a d d re s s e s m a p s t h e p h ys ica l m e m o ry o f t h e s ys t e m . Ho we ve r, a t le a s t 1 2 8 MB o f lin e a r a d d re s s e s a re a lwa ys le ft a va ila b le b e ca u s e t h e ke rn e l u s e s t h e m t o im p le m e n t n o n co n t ig u o u s m e m o ry a llo ca t io n a n d fix- m a p p e d lin e a r a d d re s s e s .

No n co n t ig u o u s m e m o ry a llo ca t io n is ju s t a s p e cia l wa y t o d yn a m ica lly a llo ca t e a n d re le a s e p a g e s o f m e m o ry, a n d is d e s crib e d in S e ct io n 7 . 3 . In t h is s e ct io n , we fo cu s o n fix- m a p p e d lin e a r a d d re s s e s . Ba s ica lly, a fix - m a p p e d lin e a r a d d re s s is a co n s t a n t lin e a r a d d re s s like 0xfffffdf0 wh o s e co rre s p o n d in g p h ys ica l a d d re s s ca n b e s e t u p in a n a rb it ra ry wa y. Th u s , e a ch fix- m a p p e d lin e a r a d d re s s m a p s o n e p a g e fra m e o f t h e p h ys ica l m e m o ry. Fix- m a p p e d lin e a r a d d re s s e s a re co n ce p t u a lly s im ila r t o t h e lin e a r a d d re s s e s t h a t m a p t h e firs t 8 9 6 MB o f RAM. Ho we ve r, a fix- m a p p e d lin e a r a d d re s s ca n m a p a n y p h ys ica l a d d re s s , wh ile t h e m a p p in g e s t a b lis h e d b y t h e lin e a r a d d re s s e s in t h e in it ia l p o rt io n o f t h e fo u rt h g ig a b yt e is lin e a r ( lin e a r a d d re s s X m a p s p h ys ica l a d d re s s X -PAGE_OFFSET) .

Wit h re s p e ct t o va ria b le p o in t e rs , fix- m a p p e d lin e a r a d d re s s e s a re m o re e fficie n t . In fa ct , d e re fe re n cin g a va ria b le p o in t e rs re q u ire s t h a t o n e m e m o ry a cce s s m o re t h a n d e re fe re n cin g a n im m e d ia t e co n s t a n t a d d re s s . Mo re o ve r, ch e ckin g t h e va lu e o f a va ria b le p o in t e r b e fo re d e re fe re n cin g it is a g o o d p ro g ra m m in g p ra ct ice ; co n ve rs e ly, t h e ch e ck is n e ve r re q u ire d fo r a co n s t a n t lin e a r a d d re s s . Ea ch fix- m a p p e d lin e a r a d d re s s is re p re s e n t e d b y a n in t e g e r in d e x d e fin e d in t h e enum

fixed_addresses d a t a s t ru ct u re : enum fixed_addresses { FIX_APIC_BASE, FIX_IO_APIC_BASE_0, [...] _ _end_of_fixed_addresses }; Fix- m a p p e d lin e a r a d d re s s e s a re p la ce d a t t h e e n d o f t h e fo u rt h g ig a b yt e o f lin e a r a d d re s s e s . Th e fix_to_virt( ) fu n ct io n co m p u t e s t h e co n s t a n t lin e a r a d d re s s s t a rt in g fro m t h e in d e x:

inline unsigned long fix_to_virt(const unsigned int idx) { if (idx >= _ _end_of_fixed_addresses) _ _this_fixmap_does_not_exist( ); return (0xffffe000UL - (idx next_task) != &init_task ; ) Th e m a cro is t h e lo o p co n t ro l s t a t e m e n t a ft e r wh ich t h e ke rn e l p ro g ra m m e r s u p p lie s t h e lo o p . No t ice h o w t h e init_task p ro ce s s d e s crip t o r ju s t p la ys t h e ro le o f lis t h e a d e r. Th e m a cro s t a rt s b y m o vin g p a s t init_task t o t h e n e xt t a s k a n d co n t in u e s u n t il it re a ch e s

init_task a g a in ( t h a n ks t o t h e circu la rit y o f t h e lis t ) . 3.2.2.4 Doubly linked lists Th e p ro ce s s lis t is a s p e cia l d o u b ly lin ke d lis t . Ho we ve r, a s yo u m a y h a ve n o t ice d , t h e Lin u x ke rn e l u s e s h u n d re d s o f d o u b ly lin ke d lis t s t h a t s t o re t h e va rio u s ke rn e l d a t a s t ru ct u re s . Fo r e a ch lis t , a s e t o f p rim it ive o p e ra t io n s m u s t b e im p le m e n t e d : in it ia lizin g t h e lis t , in s e rt in g a n d d e le t in g a n e le m e n t , s ca n n in g t h e lis t , a n d s o o n . It wo u ld b e b o t h a wa s t e o f p ro g ra m m e rs ' e ffo rt s a n d a wa s t e o f m e m o ry t o re p lica t e t h e p rim it ive o p e ra t io n s fo r e a ch

d iffe re n t lis t . Th e re fo re , t h e Lin u x ke rn e l d e fin e s t h e list_head d a t a s t ru ct u re , wh o s e fie ld s next a n d

prev re p re s e n t t h e fo rwa rd a n d b a ck p o in t e rs o f a g e n e ric d o u b ly lin ke d lis t e le m e n t , re s p e ct ive ly. It is im p o rt a n t t o n o t e , h o we ve r, t h a t t h e p o in t e rs in a list_head fie ld s t o re t h e a d d re s s e s o f o t h e r list_head fie ld s ra t h e r t h a n t h e a d d re s s e s o f t h e wh o le d a t a s t ru ct u re s in wh ich t h e list_head s t ru ct u re is in clu d e d ( s e e Fig u re 3 - 4 ) . Fig u re 3 - 4 . A d o u b ly lin k e d lis t b u ilt w it h lis t _ h e a d d a t a s t ru c t u re s

A n e w lis t is cre a t e d b y u s in g t h e LIST_HEAD(list_name) m a cro . It d e cla re s a n e w va ria b le n a m e d list_name o f t yp e list_head, wh ich is t h e co n ve n t io n a l firs t e le m e n t o f t h e n e w lis t ( m u ch a s init_task is t h e co n ve n t io n a l firs t e le m e n t o f t h e p ro ce s s lis t ) .

S e ve ra l fu n ct io n s a n d m a cro s im p le m e n t t h e p rim it ive s , in clu d in g t h o s e s h o wn in t h e fo llo win g lis t .

list_add(n,p) In s e rt s a n e le m e n t p o in t e d b y n rig h t a ft e r t h e s p e cifie d e le m e n t p o in t e d b y p ( t o in s e rt n a t t h e b e g in n in g o f t h e lis t , s e t p t o t h e a d d re s s o f t h e co n ve n t io n a l firs t e le m e n t )

list_add_tail(n,h) In s e rt s a n e le m e n t p o in t e d b y n a t t h e e n d o f t h e lis t s p e cifie d b y t h e a d d re s s h o f it s co n ve n t io n a l firs t e le m e n t

list_del(p) De le t e s a n e le m e n t p o in t e d b y p ( t h e re is n o n e e d t o s p e cify t h e co n ve n t io n a l firs t e le m e n t o f t h e lis t )

list_empty(p) Ch e cks if t h e lis t s p e cifie d b y t h e a d d re s s o f it s co n ve n t io n a l firs t e le m e n t is e m p t y

list_entry(p,t,f) Re t u rn s t h e a d d re s s o f t h e d a t a s t ru ct u re o f t yp e t in wh ich t h e list_head fie ld t h a t h a s t h e n a m e f a n d t h e a d d re s s p is in clu d e d

list_for_each(p,h) S ca n s t h e e le m e n t s o f t h e lis t s p e cifie d b y t h e a d d re s s h o f t h e co n ve n t io n a l firs t e le m e n t ( s im ila r t o for_each_task fo r t h e p ro ce s s lis t )

3.2.2.5 The list of TASK_RUNNING processes Wh e n lo o kin g fo r a n e w p ro ce s s t o ru n o n t h e CPU, t h e ke rn e l h a s t o co n s id e r o n ly t h e ru n n a b le p ro ce s s e s ( t h a t is , t h e p ro ce s s e s in t h e TASK_RUNNING s t a t e ) . S in ce it is ra t h e r in e fficie n t t o s ca n t h e wh o le p ro ce s s lis t , a d o u b ly lin ke d circu la r lis t o f TASK_RUNNING p ro ce s s e s ca lle d ru n q u e u e h a s b e e n in t ro d u ce d . Th is lis t is im p le m e n t e d t h ro u g h t h e run_list fie ld o f t yp e list_head in t h e p ro ce s s d e s crip t o r. As in t h e p re vio u s ca s e , t h e

init_task p ro ce s s d e s crip t o r p la ys t h e ro le o f lis t h e a d e r. Th e nr_running va ria b le s t o re s t h e t o t a l n u m b e r o f ru n n a b le p ro ce s s e s . Th e add_to_runqueue( ) fu n ct io n in s e rt s a p ro ce s s d e s crip t o r a t t h e b e g in n in g o f t h e lis t , wh ile del_from_runqueue( ) re m o ve s a p ro ce s s d e s crip t o r fro m t h e lis t . Fo r s ch e d u lin g p u rp o s e s , t wo fu n ct io n s , move_first_runqueue( ) a n d move_last_runqueue( ), a re p ro vid e d t o m o ve a p ro ce s s d e s crip t o r t o t h e b e g in n in g o r t h e e n d o f t h e ru n q u e u e , re s p e ct ive ly. Th e task_on_runqueue( ) fu n ct io n ch e cks wh e t h e r a g ive n p ro ce s s is in s e rt e d in t o t h e ru n q u e u e . Fin a lly, t h e wake_up_process( ) fu n ct io n is u s e d t o m a ke a p ro ce s s ru n n a b le . It s e t s t h e p ro ce s s s t a t e t o TASK_RUNNING a n d in vo ke s add_to_runqueue( ) t o in s e rt t h e p ro ce s s in t h e ru n q u e u e lis t . It a ls o fo rce s t h e in vo ca t io n o f t h e s ch e d u le r wh e n t h e p ro ce s s h a s a d yn a m ic p rio rit y la rg e r t h a n t h a t o f t h e cu rre n t p ro ce s s o r, in S MP s ys t e m s , t h a t o f a p ro ce s s cu rre n t ly e xe cu t in g o n s o m e o t h e r CPU ( s e e Ch a p t e r 1 1 ) .

3.2.2.6 The pidhash table and chained lists In s e ve ra l circu m s t a n ce s , t h e ke rn e l m u s t b e a b le t o d e rive t h e p ro ce s s d e s crip t o r p o in t e r co rre s p o n d in g t o a PID. Th is o ccu rs , fo r in s t a n ce , in s e rvicin g t h e kill( ) s ys t e m ca ll. Wh e n p ro ce s s P1 wis h e s t o s e n d a s ig n a l t o a n o t h e r p ro ce s s , P2 , it in vo ke s t h e kill( ) s ys t e m ca ll s p e cifyin g t h e PID o f P2 a s t h e p a ra m e t e r. Th e ke rn e l d e rive s t h e p ro ce s s d e s crip t o r p o in t e r fro m t h e PID a n d t h e n e xt ra ct s t h e p o in t e r t o t h e d a t a s t ru ct u re t h a t re co rd s t h e p e n d in g s ig n a ls fro m P2 's p ro ce s s d e s crip t o r. S ca n n in g t h e p ro ce s s lis t s e q u e n t ia lly a n d ch e ckin g t h e pid fie ld s o f t h e p ro ce s s d e s crip t o rs is fe a s ib le b u t ra t h e r in e fficie n t . To s p e e d u p t h e s e a rch , a pidhash h a s h t a b le co n s is t in g o f

PIDHASH_SZ e le m e n t s h a s b e e n in t ro d u ce d ( PIDHASH_SZ is u s u a lly s e t t o 1 , 0 2 4 ). Th e t a b le e n t rie s co n t a in p ro ce s s d e s crip t o r p o in t e rs . Th e PID is t ra n s fo rm e d in t o a t a b le in d e x u s in g t h e pid_hashfn m a cro :

#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1)) As e ve ry b a s ic co m p u t e r s cie n ce co u rs e e xp la in s , a h a s h fu n ct io n d o e s n o t a lwa ys e n s u re a o n e - t o - o n e co rre s p o n d e n ce b e t we e n PIDs a n d t a b le in d e xe s . Two d iffe re n t PIDs t h a t h a s h in t o t h e s a m e t a b le in d e x a re s a id t o b e co llid in g . Lin u x u s e s ch a in in g t o h a n d le co llid in g PIDs ; e a ch t a b le e n t ry is a d o u b ly lin ke d lis t o f co llid in g p ro ce s s d e s crip t o rs . Th e s e lis t s a re im p le m e n t e d b y m e a n s o f t h e pidhash_next a n d pidhash_pprev fie ld s in t h e p ro ce s s d e s crip t o r. Fig u re 3 - 5 illu s t ra t e s a pidhash t a b le wit h t wo lis t s . Th e p ro ce s s e s h a vin g PIDs 1 9 9 a n d 2 6 , 7 9 9 h a s h in t o t h e 2 0 0 t h e le m e n t o f t h e t a b le , wh ile t h e p ro ce s s h a vin g PID 2 6 , 8 0 0 h a s h e s in t o t h e 2 1 7 t h e le m e n t o f t h e t a b le . Fig u re 3 - 5 . Th e p id h a s h t a b le a n d c h a in e d lis t s

Ha s h in g wit h ch a in in g is p re fe ra b le t o a lin e a r t ra n s fo rm a t io n fro m PIDs t o t a b le in d e xe s b e ca u s e a t a n y g ive n in s t a n ce , t h e n u m b e r o f p ro ce s s e s in t h e s ys t e m is u s u a lly fa r b e lo w 3 2 , 7 6 7 ( t h e m a xim u m a llo we d PID) . It is a wa s t e o f s t o ra g e t o d e fin e a t a b le co n s is t in g o f 3 2 , 7 6 8 e n t rie s , if, a t a n y g ive n in s t a n ce , m o s t s u ch e n t rie s a re u n u s e d . Th e hash_ pid( ) a n d unhash_ pid( ) fu n ct io n s a re in vo ke d t o in s e rt a n d re m o ve a p ro ce s s in t h e pidhash t a b le , re s p e ct ive ly. Th e find_task_by_pid( ) fu n ct io n s e a rch e s t h e h a s h t a b le a n d re t u rn s t h e p ro ce s s d e s crip t o r p o in t e r o f t h e p ro ce s s wit h a g ive n PID ( o r a n u ll p o in t e r if it d o e s n o t fin d t h e p ro ce s s ) .

3.2.3 Parenthood Relationships Among Processes Pro ce s s e s cre a t e d b y a p ro g ra m h a ve a p a re n t / ch ild re la t io n s h ip . Wh e n a p ro ce s s cre a t e s m u lt ip le ch ild re n , t h e s e ch ild re n h a ve s ib lin g re la t io n s h ip s . S e ve ra l fie ld s m u s t b e in t ro d u ce d in a p ro ce s s d e s crip t o r t o re p re s e n t t h e s e re la t io n s h ip s . Pro ce s s e s 0 a n d 1 a re cre a t e d b y t h e ke rn e l; a s we s h a ll s e e la t e r in t h e ch a p t e r, p ro ce s s 1 ( in it ) is t h e a n ce s t o r o f a ll o t h e r p ro ce s s e s . Th e d e s crip t o r o f a p ro ce s s P in clu d e s t h e fo llo win g fie ld s :

p_opptr ( o rig in a l p a re n t ) Po in t s t o t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s t h a t cre a t e d P o r t o t h e d e s crip t o r o f p ro ce s s 1 ( in it ) if t h e p a re n t p ro ce s s n o lo n g e r e xis t s . Th e re fo re , wh e n a s h e ll u s e r s t a rt s a b a ckg ro u n d p ro ce s s a n d e xit s t h e s h e ll, t h e b a ckg ro u n d p ro ce s s b e co m e s t h e ch ild o f in it .

p_pptr ( p a re n t ) Po in t s t o t h e cu rre n t p a re n t o f P ( t h is is t h e p ro ce s s t h a t m u s t b e s ig n a le d wh e n t h e ch ild p ro ce s s t e rm in a t e s ) ; it s va lu e u s u a lly co in cid e s wit h t h a t o f p_opptr. It m a y o cca s io n a lly d iffe r, s u ch a s wh e n a n o t h e r p ro ce s s is s u e s a ptrace( ) s ys t e m ca ll re q u e s t in g t h a t it b e a llo we d t o m o n it o r P ( s e e S e ct io n 2 0 . 1 . 5 ) .

p_cptr ( ch ild ) Po in t s t o t h e p ro ce s s d e s crip t o r o f t h e yo u n g e s t ch ild o f P — t h a t is , o f t h e p ro ce s s cre a t e d m o s t re ce n t ly b y it .

p_ysptr ( y o u n g e r s ib lin g ) Po in t s t o t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s t h a t h a s b e e n cre a t e d im m e d ia t e ly a ft e r P b y P's cu rre n t p a re n t .

p_osptr ( o ld e r s ib lin g ) Po in t s t o t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s t h a t h a s b e e n cre a t e d im m e d ia t e ly b e fo re P b y P's cu rre n t p a re n t . Fig u re 3 - 6 illu s t ra t e s t h e p a re n t a n d s ib lin g re la t io n s h ip s o f a g ro u p o f p ro ce s s e s . Pro ce s s P0 s u cce s s ive ly cre a t e d P1 , P2 , a n d P3 . Pro ce s s P3 , in t u rn , cre a t e d p ro ce s s P4 . S t a rt in g wit h p_cptr a n d u s in g t h e p_osptr p o in t e rs t o s ib lin g s , P0 is a b le t o re t rie ve a ll it s ch ild re n . Fig u re 3 - 6 . P a re n t h o o d re la t io n s h ip s a m o n g fiv e p ro c e s s e s

3.2.4 How Processes Are Organized Th e ru n q u e u e lis t g ro u p s a ll p ro ce s s e s in a TASK_RUNNING s t a t e . Wh e n it co m e s t o g ro u p in g p ro ce s s e s in o t h e r s t a t e s , t h e va rio u s s t a t e s ca ll fo r d iffe re n t t yp e s o f t re a t m e n t , wit h Lin u x o p t in g fo r o n e o f t h e ch o ice s s h o wn in t h e fo llo win g lis t .



Pro ce s s e s in a TASK_STOPPED o r in a TASK_ZOMBIE s t a t e a re n o t lin ke d in s p e cific



lis t s . Th e re is n o n e e d t o g ro u p p ro ce s s e s in e it h e r o f t h e s e t wo s t a t e s , s in ce s t o p p e d a n d zo m b ie p ro ce s s e s a re a cce s s e d o n ly via PID o r via lin ke d lis t s o f t h e ch ild p ro ce s s e s fo r a p a rt icu la r p a re n t . Pro ce s s e s in a TASK_INTERRUPTIBLE o r TASK_UNINTERRUPTIBLE s t a t e a re s u b d ivid e d in t o m a n y cla s s e s , e a ch o f wh ich co rre s p o n d s t o a s p e cific e ve n t . In t h is ca s e , t h e p ro ce s s s t a t e d o e s n o t p ro vid e e n o u g h in fo rm a t io n t o re t rie ve t h e p ro ce s s q u ickly, s o it is n e ce s s a ry t o in t ro d u ce a d d it io n a l lis t s o f p ro ce s s e s . Th e s e a re ca lle d w a it q u e u e s .

3.2.4.1 Wait queues Wa it q u e u e s h a ve s e ve ra l u s e s in t h e ke rn e l, p a rt icu la rly fo r in t e rru p t h a n d lin g , p ro ce s s s yn ch ro n iza t io n , a n d t im in g . Be ca u s e t h e s e t o p ics a re d is cu s s e d in la t e r ch a p t e rs , we 'll ju s t s a y h e re t h a t a p ro ce s s m u s t o ft e n wa it fo r s o m e e ve n t t o o ccu r, s u ch a s fo r a d is k o p e ra t io n t o t e rm in a t e , a s ys t e m re s o u rce t o b e re le a s e d , o r a fixe d in t e rva l o f t im e t o e la p s e . Wa it q u e u e s im p le m e n t co n d it io n a l wa it s o n e ve n t s : a p ro ce s s wis h in g t o wa it fo r a s p e cific e ve n t p la ce s it s e lf in t h e p ro p e r wa it q u e u e a n d re lin q u is h e s co n t ro l. Th e re fo re , a wa it q u e u e re p re s e n t s a s e t o f s le e p in g p ro ce s s e s , wh ich a re wo e kn u p b y t h e ke rn e l wh e n s o m e co n d it io n b e co m e s t ru e . Wa it q u e u e s a re im p le m e n t e d a s d o u b ly lin ke d lis t s wh o s e e le m e n t s in clu d e p o in t e rs t o p ro ce s s d e s crip t o rs . Ea ch wa it q u e u e is id e n t ifie d b y a w a it q u e u e h e a d , a d a t a s t ru ct u re o f t yp e wait_queue_head_t:

struct _ _wait_queue_head { spinlock_t lock; struct list_head task_list; }; typedef struct _ _wait_queue_head wait_queue_head_t; S in ce wa it q u e u e s a re m o d ifie d b y in t e rru p t h a n d le rs a s we ll b y m a jo r ke rn e l fu n ct io n s , t h e d o u b ly lin ke d lis t s m u s t b e p ro t e ct e d fro m co n cu rre n t a cce s s e s , wh ich co u ld in d u ce u n p re d ict a b le re s u lt s ( s e e Ch a p t e r 5 ) . S yn ch ro n iza t io n is a ch ie ve d b y t h e lock s p in lo ck in t h e wa it q u e u e h e a d . Ele m e n t s o f a wa it q u e u e lis t a re o f t yp e wait_queue_t:

struct _ _wait_queue { unsigned int flags; struct task_struct * task; struct list_head task_list; }; typedef struct _ _wait_queue wait_queue_t; Ea ch e le m e n t in t h e wa it q u e u e lis t re p re s e n t s a s le e p in g p ro ce s s , wh ich is wa it in g fo r s o m e e ve n t t o o ccu r; it s d e s crip t o r a d d re s s is s t o re d in t h e task fie ld . Ho we ve r, it is n o t a lwa ys co n ve n ie n t t o wa ke u p a ll s le e p in g p ro ce s s e s in a wa it q u e u e . Fo r in s t a n ce , if t wo o r m o re p ro ce s s e s a re wa it in g fo r e xclu s ive a cce s s t o s o m e re s o u rce t o b e re le a s e d , it m a ke s s e n s e t o wa ke u p ju s t o n e p ro ce s s in t h e wa it q u e u e . Th is p ro ce s s t a ke s t h e re s o u rce , wh ile t h e o t h e r p ro ce s s e s co n t in u e t o s le e p . ( Th is a vo id s a p ro b le m

kn o wn a s t h e "t h u n d e rin g h e rd , " wit h wh ich m u lt ip le p ro ce s s e s a re a wo ke n o n ly t o ra ce fo r a re s o u rce t h a t ca n b e a cce s s e d b y o n e o f t h e m , a n d t h e re s u lt is t h a t re m a in in g p ro ce s s e s m u s t o n ce m o re b e p u t b a ck t o s le e p . ) Th u s , t h e re a re t wo kin d s o f s le e p in g p ro ce s s e s : e x clu s iv e p ro ce s s e s ( d e n o t e d b y t h e va lu e 1 in t h e fla g s fie ld o f t h e co rre s p o n d in g wa it q u e u e e le m e n t ) a re s e le ct ive ly wo ke n u p b y t h e ke rn e l, wh ile n o n e x clu s iv e p ro ce s s e s ( d e n o t e d b y t h e va lu e 0 in fla g s ) a re a lwa ys wo ke n u p b y t h e ke rn e l wh e n t h e e ve n t o ccu rs . A p ro ce s s wa it in g fo r a re s o u rce t h a t ca n b e g ra n t e d t o ju s t o n e p ro ce s s a t a t im e is a t yp ica l e xclu s ive p ro ce s s . Pro ce s s e s wa it in g fo r a n e ve n t like t h e t e rm in a t io n o f a d is k o p e ra t io n a re n o n e xclu s ive .

3.2.4.2 Handling wait queues Th e add_wait_queue( ) fu n ct io n in s e rt s a n o n e xclu s ive p ro ce s s in t h e firs t p o s it io n o f a wa it q u e u e lis t . Th e add_wait_queue_exclusive( ) fu n ct io n in s e rt s a n e xclu s ive p ro ce s s in t h e la s t p o s it io n o f a wa it q u e u e lis t . Th e remove_wait_queue( ) fu n ct io n re m o ve s a p ro ce s s fro m a wa it q u e u e lis t . Th e waitqueue_active( ) fu n ct io n ch e cks wh e t h e r a g ive n wa it q u e u e lis t is e m p t y. A n e w wa it q u e u e m a y b e d e fin e d b y u s in g t h e DECLARE_WAIT_QUEUE_HEAD(name) m a cro , wh ich s t a t ica lly d e cla re s a n d in it ia lize s a n e w wa it q u e u e h e a d va ria b le ca lle d name. Th e

init_waitqueue_head( ) fu n ct io n m a y b e u s e d t o in it ia lize a wa it q u e u e h e a d va ria b le t h a t wa s a llo ca t e d d yn a m ica lly. A p ro ce s s wis h in g t o wa it fo r a s p e cific co n d it io n ca n in vo ke a n y o f t h e fu n ct io n s s h o wn in t h e fo llo win g lis t . ●

Th e sleep_on( ) fu n ct io n o p e ra t e s o n t h e cu rre n t p ro ce s s :

void sleep_on(wait_queue_head_t *q) { unsigned long flags; wait_queue_t wait; wait.flags = 0; wait.task = current; current->state = TASK_UNINTERRUPTIBLE; add_wait_queue(q, &wait); schedule( ); remove_wait_queue(q, &wait); } Th e fu n ct io n s e t s t h e s t a t e o f t h e cu rre n t p ro ce s s t o TASK_UNINTERRUPTIBLE a n d in s e rt s it in t o t h e s p e cifie d wa it q u e u e . Th e n it in vo ke s t h e s ch e d u le r, wh ich re s u m e s t h e e xe cu t io n o f a n o t h e r p ro ce s s . Wh e n t h e s le e p in g p ro ce s s is wo ke n , t h e s ch e d u le r re s u m e s e xe cu t io n o f t h e sleep_on( ) fu n ct io n , wh ich re m o ve s t h e p ro ce s s fro m t h e wa it q u e u e . ●

Th e interruptible_sleep_on( ) is id e n t ica l t o sleep_on( ), e xce p t t h a t it s e t s t h e s t a t e o f t h e cu rre n t p ro ce s s t o TASK_INTERRUPTIBLE in s t e a d o f s e t t in g it t o TASK_UNINTERRUPTIBLE s o t h a t t h e p ro ce s s ca n a ls o b e wo ke n u p b y re ce ivin g a s ig n a l.



Th e sleep_on_timeout( ) a n d interruptible_sleep_on_timeout( ) fu n ct io n s a re s im ila r t o t h e p re vio u s o n e s , b u t t h e y a ls o a llo w t h e ca lle r t o d e fin e a t im e in t e rva l a ft e r wh ich t h e p ro ce s s will b e wo ke n u p b y t h e ke rn e l. To d o t h is , t h e y in vo ke t h e schedule_timeout( ) fu n ct io n in s t e a d o f schedule( ) ( s e e S e ct io n 6.6.2).



Th e wait_event a n d wait_event_interruptible m a cro s , in t ro d u ce d in Lin u x 2 . 4 , p u t t h e ca llin g p ro ce s s t o s le e p o n a wa it q u e u e u n t il a g ive n co n d it io n is ve rifie d . Fo r in s t a n ce , t h e wait_event_interruptible(wq,condition) m a cro e s s e n t ia lly yie ld s t h e fo llo win g fra g m e n t ( we h a ve o m it t e d t h e co d e re la t e d t o s ig n a l h a n d lin g a n d re t u rn va lu e s o n p u rp o s e ) :

if (!(condition)) { wait_queue_t _ _wait; init_waitqueue_entry(&_ _wait, current); add_wait_queue(&wq, &_ _wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; schedule(); } current->state = TASK_RUNNING; remove_wait_queue(&wq, &_ _wait); } Th e s e m a cro s s h o u ld b e u s e d in s t e a d o f t h e o ld e r sleep_on( ) a n d

interruptible_sleep_on( ), b e ca u s e t h e la t t e r fu n ct io n s ca n n o t t e s t a co n d it io n a n d a t o m ica lly p u t t h e p ro ce s s t o s le e p wh e n t h e co n d it io n is n o t ve rifie d a n d a re t h u s a we ll- kn o wn s o u rce o f ra ce co n d it io n s . No t ice t h a t a n y p ro ce s s p u t t o s le e p b y o n e o f t h e a b o ve fu n ct io n s o r m a cro s is n o n e xclu s ive . Wh e n e ve r t h e ke rn e l wa n t s t o in s e rt a n e xclu s ive p ro ce s s in t o a wa it q u e u e , it in vo ke s add_wait_queue_exclusive( ) d ire ct ly.

Pro ce s s e s in s e rt e d in a wa it q u e u e e n t e r t h e TASK_RUNNING s t a t e b y m e a n s o f o n e o f t h e fo llo win g m a cro s : wake_up, wake_up_nr, wake_up_all, wake_up_sync,

wake_up_sync_nr, wake_up_interruptible, wake_up_interruptible_nr, wake_up_interruptible_all, wake_up_interruptible_sync, a n d wake_up_interruptible_sync_nr. We ca n u n d e rs t a n d wh a t e a ch o f t h e s e t e n m a cro s d o e s fro m it s n a m e : ●

All m a cro s t a ke in t o co n s id e ra t io n s le e p in g p ro ce s s e s in TASK_INTERRUPTIBLE s t a t e ; if t h e m a cro n a m e d o e s n o t in clu d e t h e s t rin g "in t e rru p t ib le , " s le e p in g p ro ce s s e s in TASK_UNINTERRUPTIBLE s t a t e a re a ls o co n s id e re d .







All m a cro s wa ke a ll n o n e xclu s ive p ro ce s s e s h a vin g t h e re q u ire d s t a t e ( s e e t h e p re vio u s b u lle t it e m ) . Th e m a cro s wh o s e n a m e in clu d e t h e s t rin g "n r" wa ke a g ive n n u m b e r o f e xclu s ive p ro ce s s e s h a vin g t h e re q u ire d s t a t e ; t h is n u m b e r is a p a ra m e t e r o f t h e m a cro . Th e m a cro s wh o s e n a m e in clu d e t h e s t rin g "a ll" wa ke a ll e xclu s ive p ro ce s s e s h a vin g t h e re q u ire d s t a t e . Fin a lly, t h e m a cro s wh o s e n a m e s d o n 't in clu d e "n r" o r "a ll" wa ke e xa ct ly o n e e xclu s ive p ro ce s s t h a t h a s t h e re q u ire d s t a t e . Th e m a cro s wh o s e n a m e s d o n 't in clu d e t h e s t rin g "s yn c" ch e ck wh e t h e r t h e p rio rit y o f t h e wo ke n p ro ce s s e s is h ig h e r t h a n t h a t o f t h e p ro ce s s e s cu rre n t ly ru n n in g in t h e

s ys t e m s a n d in vo ke schedule( ) if n e ce s s a ry. Th e s e ch e cks a re n o t m a d e b y t h e m a cro s wh o s e n a m e s in clu d e t h e s t rin g "s yn c. " Fo r in s t a n ce , t h e wake_up m a cro is e q u iva le n t t o t h e fo llo win g co d e fra g m e n t :

void wake_up(wait_queue_head_t *q) { struct list_head *tmp; wait_queue_t *curr; list_for_each(tmp, &q->task_list) { curr = list_entry(tmp, wait_queue_t, task_list); wake_up_process(curr->task); if (curr->flags) break; } } Th e list_for_each m a cro s ca n s a ll it e m s in t h e d o u b ly lin ke d lis t o f q. Fo r e a ch it e m , t h e

list_entry m a cro co m p u t e s t h e a d d re s s o f t h e co rre s p o n d e n t wait_queue_t va ria b le . Th e task fie ld o f t h is va ria b le s t o re s t h e p o in t e r t o t h e p ro ce s s d e s crip t o r, wh ich is t h e n p a s s e d t o t h e wake_up_process( ) fu n ct io n . If t h e wo ke n p ro ce s s is e xclu s ive , t h e lo o p t e rm in a t e s . S in ce a ll n o n e xclu s ive p ro ce s s e s a re a lwa ys a t t h e b e g in n in g o f t h e d o u b ly lin ke d lis t a n d a ll e xclu s ive p ro ce s s e s a re a t t h e e n d , t h e fu n ct io n a lwa ys wa ke n t h e n o n e xclu s ive p ro ce s s e s a n d t h e n wa ke s o n e e xclu s ive p ro ce s s , if a n y e xis t s . [ 3 ] No t ice t h a t a wo ke n p ro ce s s e s a re n o t re m o ve d fro m t h e wa it q u e u e . A p ro ce s s co u ld b e a wo ke n wh ile t h e wa it co n d it io n is s t ill fa ls e ; in t h is ca s e , t h e p ro ce s s m a y s u s p e n d it s e lf a g a in in t h e s a m e wa it q u e u e . [3]

By t h e wa y, it is ra t h e r u n co m m o n t h a t a wa it q u e u e in clu d e s b o t h e xclu s ive a n d n o n e xclu s ive p ro ce s s e s .

3.2.5 Process Resource Limits Ea ch p ro ce s s h a s a n a s s o cia t e d s e t o f re s o u rce lim it s , wh ich s p e cify t h e a m o u n t o f s ys t e m re s o u rce s it ca n u s e . Th e s e lim it s ke e p a u s e r fro m o ve rwh e lm in g t h e s ys t e m ( it s CPU, d is k s p a ce , a n d s o o n ) . Lin u x re co g n ize s t h e fo llo win g re s o u rce lim it s :

RLIMIT_AS Th e m a xim u m s ize o f p ro ce s s a d d re s s s p a ce , in b yt e s . Th e ke rn e l ch e cks t h is va lu e wh e n t h e p ro ce s s u s e s malloc( ) o r a re la t e d fu n ct io n t o e n la rg e it s a d d re s s s p a ce ( s e e S e ct io n 8 . 1 ) .

RLIMIT_CORE Th e m a xim u m co re d u m p file s ize , in b yt e s . Th e ke rn e l ch e cks t h is va lu e wh e n a p ro ce s s is a b o rt e d , b e fo re cre a t in g a core file in t h e cu rre n t d ire ct o ry o f t h e p ro ce s s ( s e e S e ct io n 1 0 . 1 . 1 ) . If t h e lim it is 0 , t h e ke rn e l wo n 't cre a t e t h e file .

RLIMIT_CPU Th e m a xim u m CPU t im e fo r t h e p ro ce s s , in s e co n d s . If t h e p ro ce s s e xce e d s t h e lim it , t h e ke rn e l s e n d s it a SIGXCPU s ig n a l, a n d t h e n , if t h e p ro ce s s d o e s n 't t e rm in a t e , a

SIGKILL s ig n a l ( s e e Ch a p t e r 1 0 ) . RLIMIT_DATA Th e m a xim u m h e a p s ize , in b yt e s . Th e ke rn e l ch e cks t h is va lu e b e fo re e xp a n d in g t h e h e a p o f t h e p ro ce s s ( s e e S e ct io n 8 . 6 ) .

RLIMIT_FSIZE Th e m a xim u m file s ize a llo we d , in b yt e s . If t h e p ro ce s s t rie s t o e n la rg e a file t o a s ize g re a t e r t h a n t h is va lu e , t h e ke rn e l s e n d s it a SIGXFSZ s ig n a l.

RLIMIT_LOCKS Th e m a xim u m n u m b e r o f file lo cks . Th e ke rn e l ch e cks t h is va lu e wh e n t h e p ro ce s s e n fo rce s a lo ck o n a file ( s e e S e ct io n 1 2 . 7 ) .

RLIMIT_MEMLOCK Th e m a xim u m s ize o f n o n s wa p p a b le m e m o ry, in b yt e s . Th e ke rn e l ch e cks t h is va lu e wh e n t h e p ro ce s s t rie s t o lo ck a p a g e fra m e in m e m o ry u s in g t h e mlock( ) o r

mlockall( ) s ys t e m ca lls ( s e e S e ct io n 8 . 3 . 4 ) . RLIMIT_NOFILE Th e m a xim u m n u m b e r o f o p e n file d e s crip t o rs . Th e ke rn e l ch e cks t h is va lu e wh e n o p e n in g a n e w file o r d u p lica t in g a file d e s crip t o r ( s e e Ch a p t e r 1 2 ) .

RLIMIT_NPROC Th e m a xim u m n u m b e r o f p ro ce s s e s t h a t t h e u s e r ca n o wn ( s e e S e ct io n 3 . 4 . 1 la t e r in t h is ch a p t e r) .

RLIMIT_RSS Th e m a xim u m n u m b e r o f p a g e fra m e s o wn e d b y t h e p ro ce s s . Th e ke rn e l ch e cks t h is va lu e wh e n t h e p ro ce s s u s e s malloc( ) o r a re la t e d fu n ct io n t o e n la rg e it s a d d re s s s p a ce ( s e e S e ct io n 8 . 1 ) .

RLIMIT_STACK Th e m a xim u m s t a ck s ize , in b yt e s . Th e ke rn e l ch e cks t h is va lu e b e fo re e xp a n d in g t h e Us e r Mo d e s t a ck o f t h e p ro ce s s ( s e e S e ct io n 8 . 4 ) .

Th e re s o u rce lim it s a re s t o re d in t h e rlim fie ld o f t h e p ro ce s s d e s crip t o r. Th e fie ld is a n a rra y o f e le m e n t s o f t yp e struct rlimit, o n e fo r e a ch re s o u rce lim it :

struct rlimit { unsigned long rlim_cur; unsigned long rlim_max; }; Th e rlim_cur fie ld is t h e cu rre n t re s o u rce lim it fo r t h e re s o u rce . Fo r e xa m p le , current-

>rlim[RLIMIT_CPU].rlim_cur re p re s e n t s t h e cu rre n t lim it o n t h e CPU t im e o f t h e ru n n in g p ro ce s s . Th e rlim_max fie ld is t h e m a xim u m a llo we d va lu e fo r t h e re s o u rce lim it . By u s in g t h e

getrlimit( ) a n d setrlimit( ) s ys t e m ca lls , a u s e r ca n a lwa ys in cre a s e t h e rlim_cur lim it o f s o m e re s o u rce u p t o rlim_max. Ho we ve r, o n ly t h e s u p e ru s e r ( o r, m o re p re cis e ly, a u s e r wh o h a s t h e CAP_SYS_RESOURCE ca p a b ilit y) ca n in cre a s e t h e rlim_max fie ld o r s e t t h e rlim_cur fie ld t o a va lu e g re a t e r t h a n t h e co rre s p o n d in g rlim_max fie ld . Mo s t re s o u rce lim it s co n t a in t h e va lu e RLIM_INFINITY ( 0xffffffff) , wh ich m e a n s t h a t n o u s e r lim it is im p o s e d o n t h e co rre s p o n d in g re s o u rce ( o f co u rs e , re a l lim it s e xis t d u e t o ke rn e l d e s ig n re s t rict io n s , a va ila b le RAM, a va ila b le s p a ce o n d is k, e t c. ) . Ho we ve r, t h e s ys t e m a d m in is t ra t o r m a y ch o o s e t o im p o s e s t ro n g e r lim it s o n s o m e re s o u rce s . Wh e n e ve r a u s e r lo g s in t o t h e s ys t e m , t h e ke rn e l cre a t e s a p ro ce s s o wn e d b y t h e s u p e ru s e r, wh ich ca n in vo ke setrlimit( ) t o d e cre a s e t h e rlim_max a n d rlim_cur fie ld s fo r a re s o u rce . Th e s a m e p ro ce s s la t e r e xe cu t e s a lo g in s h e ll a n d b e co m e s o wn e d b y t h e u s e r. Ea ch n e w p ro ce s s cre a t e d b y t h e u s e r in h e rit s t h e co n t e n t o f t h e rlim a rra y fro m it s p a re n t , a n d t h e re fo re t h e u s e r ca n n o t o ve rrid e t h e lim it s e n fo rce d b y t h e s ys t e m .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

3.3 Process Switch To co n t ro l t h e e xe cu t io n o f p ro ce s s e s , t h e ke rn e l m u s t b e a b le t o s u s p e n d t h e e xe cu t io n o f t h e p ro ce s s ru n n in g o n t h e CPU a n d re s u m e t h e e xe cu t io n o f s o m e o t h e r p ro ce s s p re vio u s ly s u s p e n d e d . Th is a ct ivit y g o e s va rio u s ly b y t h e n a m e s p ro ce s s s w it ch , t a s k s w it ch , o r co n t e x t s w it ch . Th e n e xt s e ct io n s d e s crib e t h e e le m e n t s o f p ro ce s s s wit ch in g in Lin u x.

3.3.1 Hardware Context Wh ile e a ch p ro ce s s ca n h a ve it s o wn a d d re s s s p a ce , a ll p ro ce s s e s h a ve t o s h a re t h e CPU re g is t e rs . S o b e fo re re s u m in g t h e e xe cu t io n o f a p ro ce s s , t h e ke rn e l m u s t e n s u re t h a t e a ch s u ch re g is t e r is lo a d e d wit h t h e va lu e it h a d wh e n t h e p ro ce s s wa s s u s p e n d e d . Th e s e t o f d a t a t h a t m u s t b e lo a d e d in t o t h e re g is t e rs b e fo re t h e p ro ce s s re s u m e s it s e xe cu t io n o n t h e CPU is ca lle d t h e h a rd w a re co n t e x t . Th e h a rd wa re co n t e xt is a s u b s e t o f t h e p ro ce s s e xe cu t io n co n t e xt , wh ich in clu d e s a ll in fo rm a t io n n e e d e d fo r t h e p ro ce s s e xe cu t io n . In Lin u x, a p a rt o f t h e h a rd wa re co n t e xt o f a p ro ce s s is s t o re d in t h e p ro ce s s d e s crip t o r, wh ile t h e re m a in in g p a rt is s a ve d in t h e Ke rn e l Mo d e s t a ck. In t h e d e s crip t io n t h a t fo llo ws , we will a s s u m e t h e prev lo ca l va ria b le re fe rs t o t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s b e in g s wit ch e d o u t a n d next re fe rs t o t h e o n e b e in g s wit ch e d in t o re p la ce it . We ca n t h u s d e fin e a p ro ce s s s w it ch a s t h e a ct ivit y co n s is t in g o f s a vin g t h e h a rd wa re co n t e xt o f prev a n d re p la cin g it wit h t h e h a rd wa re co n t e xt o f next. S in ce p ro ce s s s wit ch e s o ccu r q u it e o ft e n , it is im p o rt a n t t o m in im ize t h e t im e s p e n t in s a vin g a n d lo a d in g h a rd wa re co n t e xt s . Old ve rs io n s o f Lin u x t o o k a d va n t a g e o f t h e h a rd wa re s u p p o rt o ffe re d b y t h e In t e l a rch it e ct u re a n d p e rfo rm e d a p ro ce s s s wit ch t h ro u g h a far jmp in s t ru ct io n [ 4 ] t o t h e s e le ct o r o f t h e Ta s k S t a t e S e g m e n t De s crip t o r o f t h e next p ro ce s s . Wh ile e xe cu t in g t h e in s t ru ct io n , t h e CPU p e rfo rm s a h a rd w a re co n t e x t s w it ch b y a u t o m a t ica lly s a vin g t h e o ld h a rd wa re co n t e xt a n d lo a d in g a n e w o n e . Bu t Lin u x 2 . 4 u s e s s o ft wa re t o p e rfo rm a p ro ce s s s wit ch fo r t h e fo llo win g re a s o n s : [4]

far jmp in s t ru ct io n s m o d ify b o t h t h e cs a n d eip re g is t e rs , wh ile s im p le jmp in s t ru ct io n s m o d ify o n ly eip.



S t e p - b y- s t e p s wit ch in g p e rfo rm e d t h ro u g h a s e q u e n ce o f mov in s t ru ct io n s a llo ws b e t t e r co n t ro l o ve r t h e va lid it y o f t h e d a t a b e in g lo a d e d . In p a rt icu la r, it is p o s s ib le t o ch e ck t h e va lu e s o f s e g m e n t a t io n re g is t e rs . Th is t yp e o f ch e ckin g is n o t p o s s ib le wh e n u s in g a s in g le far jmp



in s t ru ct io n . Th e a m o u n t o f t im e re q u ire d b y t h e o ld a p p ro a ch a n d t h e n e w a p p ro a ch is a b o u t t h e s a m e . Ho we ve r, it is n o t p o s s ib le t o o p t im ize a h a rd wa re co n t e xt s wit ch , wh ile t h e re m ig h t b e ro o m fo r im p ro vin g t h e cu rre n t s wit ch in g co d e .

Pro ce s s s wit ch in g o ccu rs o n ly in Ke rn e l Mo d e . Th e co n t e n t s o f a ll re g is t e rs u s e d b y a p ro ce s s in Us e r Mo d e h a ve a lre a d y b e e n s a ve d b e fo re p e rfo rm in g p ro ce s s s wit ch in g ( s e e Ch a p t e r 4 ) . Th is in clu d e s t h e co n t e n t s o f t h e ss a n d esp p a ir t h a t s p e cifie s t h e Us e r Mo d e s t a ck p o in t e r a d d re s s .

3.3.2 Task State Segment Th e 8 0 x 8 6 a rch it e ct u re in clu d e s a s p e cific s e g m e n t t yp e ca lle d t h e Ta s k S t a t e S e g m e n t ( TS S ) , t o s t o re h a rd wa re co n t e xt s . Alt h o u g h Lin u x d o e s n 't u s e h a rd wa re co n t e xt s wit ch e s , it is n o n e t h e le s s fo rce d t o s e t u p a TS S fo r e a ch d is t in ct CPU in t h e s ys t e m . Th is is d o n e fo r t wo m a in re a s o n s : ●

Wh e n a n 8 0 x 8 6 CPU s wit ch e s fro m Us e r Mo d e t o Ke rn e l Mo d e , it fe t ch e s t h e a d d re s s o f t h e Ke rn e l Mo d e s t a ck fro m t h e TS S ( s e e Ch a p t e r 4 ) .



Wh e n a Us e r Mo d e p ro ce s s a t t e m p t s t o a cce s s a n I/ O p o rt b y m e a n s o f a n in o r out in s t ru ct io n , t h e CPU m a y n e e d t o a cce s s a n I/ O Pe rm is s io n Bit m a p s t o re d in t h e TS S t o ve rify wh e t h e r t h e p ro ce s s is a llo we d t o a d d re s s t h e p o rt . Mo re p re cis e ly, wh e n a p ro ce s s e xe cu t e s a n in o r out I/ O in s t ru ct io n in Us e r Mo d e , t h e co n t ro l u n it p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . It ch e cks t h e 2 - b it IOPL fie ld in t h e eflags re g is t e r. If it is s e t t o 3 , t h e co n t ro l u n it e xe cu t e s t h e I/ O in s t ru ct io n s . Ot h e rwis e , it p e rfo rm s t h e n e xt ch e ck. 2 . It a cce s s e s t h e tr re g is t e r t o d e t e rm in e t h e cu rre n t TS S , a n d t h u s t h e p ro p e r I/ O Pe rm is s io n Bit m a p . 3 . It ch e cks t h e b it o f t h e I/ O Pe rm is s io n Bit m a p co rre s p o n d in g t o t h e I/ O p o rt s p e cifie d in t h e I/ O in s t ru ct io n . If it is cle a re d , t h e in s t ru ct io n is e xe cu t e d ; o t h e rwis e , t h e co n t ro l u n it ra is e s a "Ge n e ra l p ro t e ct io n e rro r" e xce p t io n .

Th e tss_struct s t ru ct u re d e s crib e s t h e fo rm a t o f t h e TS S . As a lre a d y m e n t io n e d in Ch a p t e r 2 , t h e

init_tss a rra y s t o re s o n e TS S fo r e a ch d iffe re n t CPU o n t h e s ys t e m . At e a ch p ro ce s s s wit ch , t h e ke rn e l u p d a t e s s o m e fie ld s o f t h e TS S s o t h a t t h e co rre s p o n d in g CPU's co n t ro l u n it m a y s a fe ly re t rie ve t h e in fo rm a t io n it n e e d s . Ea ch TS S h a s it s o wn 8 - b yt e Ta s k S t a t e S e g m e n t De s crip t o r ( TS S D) . Th is De s crip t o r in clu d e s a 3 2 - b it Base fie ld t h a t p o in t s t o t h e TS S s t a rt in g a d d re s s a n d a 2 0 - b it Limit fie ld . Th e S fla g o f a TS S D is cle a re d t o d e n o t e t h e fa ct t h a t t h e co rre s p o n d in g TS S is a S y s t e m S e g m e n t . Th e Type fie ld is s e t t o e it h e r 9 o r 1 1 t o d e n o t e t h a t t h e s e g m e n t is a ct u a lly a TS S . In t h e In t e l's o rig in a l d e s ig n , e a ch p ro ce s s in t h e s ys t e m s h o u ld re fe r t o it s o wn TS S ; t h e s e co n d le a s t s ig n ifica n t b it o f t h e Type fie ld is ca lle d t h e Bu s y b it ; it is s e t t o 1 if t h e p ro ce s s is b e in g e xe cu t e d b y a CPU, a n d t o 0 o t h e rwis e . In Lin u x d e s ig n , t h e re is ju s t o n e TS S fo r e a ch CPU, s o t h e Bu s y b it is a lwa ys s e t t o 1 . Th e TS S Ds cre a t e d b y Lin u x a re s t o re d in t h e Glo b a l De s crip t o r Ta b le ( GDT) , wh o s e b a s e a d d re s s is s t o re d in t h e gdtr re g is t e r o f e a ch CPU. Th e tr re g is t e r o f e a ch CPU co n t a in s t h e TS S D S e le ct o r o f t h e co rre s p o n d in g TS S . Th e re g is t e r a ls o in clu d e s t wo h id d e n , n o n p ro g ra m m a b le fie ld s : t h e Base a n d

Limit fie ld s o f t h e TS S D. In t h is wa y, t h e p ro ce s s o r ca n a d d re s s t h e TS S d ire ct ly wit h o u t h a vin g t o re t rie ve t h e TS S a d d re s s fro m t h e GDT.

3.3.2.1 The thread field At e ve ry p ro ce s s s wit ch , t h e h a rd wa re co n t e xt o f t h e p ro ce s s b e in g re p la ce d m u s t b e s a ve d s o m e wh e re . It ca n n o t b e s a ve d o n t h e TS S , a s in t h e o rig in a l In t e l d e s ig n , b e ca u s e we ca n n o t m a ke a s s u m p t io n s a b o u t wh e n t h e p ro ce s s b e in g re p la ce d will re s u m e e xe cu t io n a n d wh a t CPU will e xe cu t e it a g a in . Th u s , e a ch p ro ce s s d e s crip t o r in clu d e s a fie ld ca lle d thread o f t yp e thread_struct, in wh ich t h e ke rn e l s a ve s t h e h a rd wa re co n t e xt wh e n e ve r t h e p ro ce s s is b e in g s wit ch e d o u t . As we s h a ll s e e la t e r, t h is d a t a s t ru ct u re in clu d e s fie ld s fo r m o s t o f t h e CPU re g is t e rs , s u ch a s t h e g e n e ra l- p u rp o s e re g is t e rs , t h e flo a t in g p o in t re g is t e rs , a n d s o o n .

3.3.3 Performing the Process Switch A p ro ce s s s wit ch m a y o ccu r a t ju s t o n e we ll- d e fin e d p o in t : t h e schedule( ) fu n ct io n ( d is cu s s e d a t le n g t h in Ch a p t e r 1 1 ) . He re , we a re o n ly co n ce rn e d wit h h o w t h e ke rn e l p e rfo rm s a p ro ce s s s wit ch . Es s e n t ia lly, e ve ry p ro ce s s s wit ch co n s is t s o f t wo s t e p s :

1 . S wit ch in g t h e Pa g e Glo b a l Dire ct o ry t o in s t a ll a n e w a d d re s s s p a ce ; we 'll d e s crib e t h is s t e p in Ch a p t e r 8 . 2 . S wit ch in g t h e Ke rn e l Mo d e s t a ck a n d t h e h a rd wa re co n t e xt , wh ich p ro vid e s a ll t h e in fo rm a t io n n e e d e d b y t h e ke rn e l t o e xe cu t e t h e n e w p ro ce s s , in clu d in g t h e CPU re g is t e rs . Ag a in , we a s s u m e t h a t prev p o in t s t o t h e d e s crip t o r o f t h e p ro ce s s b e in g re p la ce d , a n d next t o t h e d e s crip t o r o f t h e p ro ce s s b e in g a ct iva t e d . As we s h a ll s e e in Ch a p t e r 1 1 , prev a n d next a re lo ca l va ria b le s o f t h e schedule( ) fu n ct io n .

Th e s e co n d s t e p o f t h e p ro ce s s s wit ch is p e rfo rm e d b y t h e switch_to m a cro . It is o n e o f t h e m o s t h a rd wa re - d e p e n d e n t ro u t in e s o f t h e ke rn e l, a n d it t a ke s s o m e e ffo rt t o u n d e rs t a n d wh a t it d o e s . Firs t o f a ll, t h e m a cro h a s t h re e p a ra m e t e rs ca lle d prev, next, a n d last. Th e a ct u a l in vo ca t io n o f t h e m a cro in schedule( ) is :

switch_to(prev, next, prev); Yo u m ig h t e a s ily g u e s s t h e ro le o f prev a n d next — t h e y a re ju s t p la ce h o ld e rs fo r t h e lo ca l va ria b le s

prev a n d next — b u t wh a t a b o u t t h e t h ird p a ra m e t e r last? We ll, t h e p o in t is t h a t in a n y p ro ce s s s wit ch , t h re e p ro ce s s e s a re in vo lve d , n o t ju s t t wo . S u p p o s e t h e ke rn e l d e cid e s t o s wit ch o ff p ro ce s s A a n d t o a ct iva t e p ro ce s s B. In t h e schedule( ) fu n ct io n , prev p o in t s t o A's d e s crip t o r a n d next p o in t s t o B's d e s crip t o r. As s o o n a s t h e switch_to m a cro d e a ct iva t e s A, t h e e xe cu t io n flo w o f A fre e ze s . La t e r, wh e n t h e ke rn e l wa n t s t o re a ct iva t e A, it m u s t s wit ch o ff a n o t h e r p ro ce s s C ( in g e n e ra l, t h is is d iffe re n t fro m B) b y e xe cu t in g a n o t h e r switch_to m a cro wit h prev p o in t in g t o C a n d next p o in t in g t o A. Wh e n A re s u m e s it s e xe cu t io n flo w, it fin d s it s o ld Ke rn e l Mo d e s t a ck, s o t h e prev lo ca l va ria b le p o in t s t o A's d e s crip t o r a n d next p o in t s t o B's d e s crip t o r. Th e ke rn e l, wh ich is n o w e xe cu t in g o n b e h a lf o f p ro ce s s A, h a s lo s t a n y re fe re n ce t o C. Th e la s t p a ra m e t e r o f t h e switch_to m a cro re in s e rt s t h e a d d re s s o f C's d e s crip t o r in t o t h e prev lo ca l va ria b le . Th e m e ch a n is m e xp lo it s t h e s t a t e o f re g is t e rs d u rin g fu n ct io n ca lls . Th e firs t prev p a ra m e t e r co rre s p o n d s t o a CPU re g is t e r, wh ich is lo a d e d wit h t h e co n t e n t o f t h e prev lo ca l va ria b le wh e n t h e m a cro s t a rt s . Wh e n t h e m a cro e n d s , it writ e s t h e co n t e n t o f t h e s a m e re g is t e r in t h e last p a ra m e t e r — n a m e ly, in t h e prev lo ca l va ria b le . Ho we ve r, t h e CPU re g is t e r d o e s n 't ch a n g e a cro s s t h e p ro ce s s s wit ch , s o prev re ce ive s t h e a d d re s s o f C's d e s crip t o r ( a s we s h a ll s e e in Ch a p t e r 1 1 , t h e s ch e d u le r ch e cks wh e t h e r C s h o u ld b e re a d ily e xe cu t e d o n a n o t h e r CPU) . He re is a d e s crip t io n o f wh a t t h e switch_to m a cro d o e s o n a n 8 0 x 8 6 m icro p ro ce s s o r:

1 . S a ve s t h e va lu e s o f prev a n d next in t h e eax a n d edx re g is t e rs , re s p e ct ive ly: movl prev,%eax movl next,%edx

Th e eax a n d edx re g is t e rs co rre s p o n d t o t h e prev a n d next p a ra m e t e rs o f t h e m a cro .

2 . S a ve s a n o t h e r co p y o f prev in t h e ebx re g is t e r; ebx co rre s p o n d s t o t h e last p a ra m e t e r o f t h e m a cro :

movl %eax,%ebx

3 . S a ve s t h e co n t e n t s o f t h e esi, edi, a n d ebp re g is t e rs in t h e prev Ke rn e l Mo d e s t a ck. Th e y m u s t b e s a ve d b e ca u s e t h e co m p ile r a s s u m e s t h a t t h e y will s t a y u n ch a n g e d u n t il t h e e n d o f switch_to: pushl %esi pushl %edi pushl %ebp

4 . S a ve s t h e co n t e n t o f esp in prev->thread.esp s o t h a t t h e fie ld p o in t s t o t h e t o p o f t h e prev Ke rn e l Mo d e s t a ck: movl %esp, 616(%eax)

Th e 616(%eax) o p e ra n d id e n t ifie s t h e m e m o ry ce ll wh o s e a d d re s s is t h e co n t e n t s o f eax p lu s 616. 5 . Lo a d s next->thread.esp in esp. Fro m n o w o n , t h e ke rn e l o p e ra t e s o n t h e Ke rn e l Mo d e s t a ck o f next, s o t h is in s t ru ct io n p e rfo rm s t h e a ct u a l p ro ce s s s wit ch fro m prev t o next. S in ce t h e a d d re s s o f a p ro ce s s d e s crip t o r is clo s e ly re la t e d t o t h a t o f t h e Ke rn e l Mo d e s t a ck ( a s e xp la in e d in S e ct io n 3 . 2 . 2 e a rlie r in t h is ch a p t e r) , ch a n g in g t h e ke rn e l s t a ck m e a n s ch a n g in g t h e cu rre n t p ro ce s s : movl 616(%edx), %esp

6 . S a ve s t h e a d d re s s la b e le d 1 ( s h o wn la t e r in t h is s e ct io n ) in prev->thread.eip. Wh e n t h e p ro ce s s b e in g re p la ce d re s u m e s it s e xe cu t io n , t h e p ro ce s s e xe cu t e s t h e in s t ru ct io n la b e le d a s 1: movl $1f, 612(%eax)

7 . On t h e Ke rn e l Mo d e s t a ck o f next, t h e m a cro p u s h e s t h e next->thread.eip va lu e , wh ich , in m o s t ca s e s , is t h e a d d re s s la b e le d 1 : pushl 612(%edx)

8 . Ju m p s t o t h e _ _switch_to( ) C fu n ct io n : jmp _ _switch_to

Th is fu n ct io n a ct s o n t h e prev a n d next p a ra m e t e rs t h a t d e n o t e t h e fo rm e r p ro ce s s a n d t h e n e w p ro ce s s . Th is fu n ct io n ca ll is d iffe re n t fro m t h e a ve ra g e fu n ct io n ca ll, t h o u g h , b e ca u s e _

_switch_to( ) t a ke s t h e prev a n d next p a ra m e t e rs fro m t h e eax a n d edx ( wh e re we s a w t h e y we re s t o re d ) , n o t fro m t h e s t a ck like m o s t fu n ct io n s . To fo rce t h e fu n ct io n t o g o t o t h e re g is t e rs fo r it s p a ra m e t e rs , t h e ke rn e l u s e s t h e _ _attribute_ _ a n d regparm ke ywo rd s , wh ich a re n o n s t a n d a rd e xt e n s io n s o f t h e C la n g u a g e im p le m e n t e d b y t h e gcc co m p ile r. Th e _

_switch_to( ) fu n ct io n is d e cla re d in t h e in clu d e / a s m - i3 8 6 / s y s t e m . h h e a d e r file a s fo llo ws : _

_switch_to(struct task_struct *prev, struct task_struct *next) _ _attribute_ _(regparm(3))

Th e _ _switch_to( ) fu n ct io n co m p le t e s t h e p ro ce s s s wit ch s t a rt e d b y t h e switch_to( ) m a cro . It in clu d e s e xt e n d e d in lin e a s s e m b ly la n g u a g e co d e t h a t m a ke s fo r ra t h e r co m p le x

re a d in g b e ca u s e t h e co d e re fe rs t o re g is t e rs b y m e a n s o f s p e cia l s ym b o ls : a . Exe cu t e s t h e co d e yie ld e d b y t h e unlazy_fpu( ) m a cro ( s e e S e ct io n 3 . 3 . 4 la t e r in t h is ch a p t e r) t o o p t io n a lly s a ve t h e co n t e n t s o f t h e FPU, MMX, a n d XMM re g is t e rs . As we s h a ll s e e , t h e re is n o n e e d t o lo a d t h e co rre s p o n d in g re g is t e rs o f next wh ile p e rfo rm in g t h e co n t e xt s wit ch : unlazy_fpu(prev);

b . Lo a d s next->esp0 in t h e esp0 fie ld o f t h e TS S re la t ive t o t h e cu rre n t CPU s o t h a t a n y fu t u re p rivile g e le ve l ch a n g e fro m Us e r Mo d e t o Ke rn e l Mo d e a u t o m a t ica lly fo rce s t h is a d d re s s in t o t h e esp re g is t e r: init_tss[smp_processor_id( )].esp0 = next->thread.esp0;

Th e smp_processor_id( ) m a cro yie ld s t h e in d e x o f t h e e xe cu t in g CPU.

c. S t o re s t h e co n t e n t s o f t h e fs a n d gs s e g m e n t a t io n re g is t e rs in prev->thread.fs a n d

prev->thread.gs, re s p e ct ive ly; t h e co rre s p o n d in g a s s e m b ly la n g u a g e in s t ru ct io n s a re : movl %fs,620(%esi) movl %gs,624(%esi)

Th e esi re g is t e r p o in t s t o t h e prev->thread s t ru ct u re .

d . Lo a d s t h e fs a n d gs s e g m e n t re g is t e rs wit h t h e va lu e s co n t a in e d in next->thread.fs a n d next->thread.gs, re s p e ct ive ly. Th is s t e p lo g ica lly co m p le m e n t s t h e a ct io n s p e rfo rm e d in t h e p re vio u s s t e p . Th e co rre s p o n d in g a s s e m b ly la n g u a g e in s t ru ct io n s a re : movl 12(%ebx),%fs movl 17(%ebx),%gs

Th e ebx re g is t e r p o in t s t o t h e next->thread s t ru ct u re . Th e co d e is a ct u a lly m o re in t rica t e , a s a n e xce p t io n m ig h t b e ra is e d b y t h e CPU wh e n it d e t e ct s a n in va lid s e g m e n t re g is t e r va lu e . Th e co d e t a ke s t h is p o s s ib ilit y in t o a cco u n t b y a d o p t in g a "fix- u p " a p p ro a ch ( s e e S e ct io n 9 . 2 . 6 ) .

e . Lo a d s s ix d e b u g re g is t e rs [ 5 ] wit h t h e co n t e n t s o f t h e next->thread.debugreg a rra y.

[ 5 ] Th e 8 0 x 8 6 d e b u g re g is t e rs a llo w a p ro ce s s t o b e m o n it o re d b y t h e h a rd wa re . Up t o fo u r b re a kp o in t a re a s m a y b e d e fin e d . Wh e n e ve r a m o n it o re d p ro ce s s is s u e s a lin e a r a d d re s s in clu d e d in o n e o f t h e b re a kp o in t a re a s , a n e xce p t io n o ccu rs .

Th is is d o n e o n ly if next wa s u s in g t h e d e b u g re g is t e rs wh e n it wa s s u s p e n d e d ( t h a t is , fie ld next->thread.debugreg[7] is n o t 0 ) . As we s h a ll s e e in Ch a p t e r 2 0 , t h e s e re g is t e rs a re m o d ifie d o n ly b y writ in g in t h e TS S , s o t h e re is n o n e e d t o s a ve t h e co rre s p o n d in g re g is t e rs o f prev: if (next->thread.debugreg[7]){ loaddebug(&next->thread, 0); loaddebug(&next->thread, 1); loaddebug(&next->thread, 2); loaddebug(&next->thread, 3);

/* no 4 and 5 */ loaddebug(&next->thread, 6); loaddebug(&next->thread, 7); }

9 . Up d a t e s t h e I/ O b it m a p in t h e TS S , if n e ce s s a ry. Th is m u s t b e d o n e wh e n e it h e r next o r prev h a ve t h e ir o wn cu s t o m ize d I/ O Pe rm is s io n Bit m a p : if (next->thread.ioperm) { memcpy(init_tss[smp_processor_id( )].io_bitmap, next->thread.io_bitmap, 128)); init_tss[smp_processor_id( )].bitmap = 104; } else if (prev->thread.ioperm) init_tss[smp_processor_id( )].bitmap = 0x8000;

Th e cu s t o m ize d I/ O Pe rm is s io n Bit m a p o f a p ro ce s s is s t o re d in a b u ffe r p o in t e d t o b y t h e thread.io_bitmap fie ld o f t h e p ro ce s s d e s crip t o r. If next h a s a cu s t o m ize d b it m a p , it is co p ie d in t o t h e io_bitmap fie ld o f t h e TS S . Ot h e rwis e , if next d o e s n 't h a ve it , t h e ke rn e l ch e cks wh e t h e r prev d e fin e d s u ch a b it m a p . In t h is ca s e , t h e b it m a p m u s t b e in va lid a t e d .

1 0 . Te rm in a t e s . Like a n y o t h e r fu n ct io n , _ _switch_to( ) e n d s b y m e a n s o f a ret a s s e m b ly la n g u a g e in s t ru ct io n , wh ich lo a d s t h e eip p ro g ra m co u n t e r wit h t h e re t u rn a d d re s s s t o re d in t o t h e s t a ck. Ho we ve r, t h e _ _switch_to( ) fu n ct io n h a s b e e n in vo ke d s im p ly b y ju m p in g in t o it . Th e re fo re t h e ret a s s e m b ly la n g u a g e in s t ru ct io n fin d s o n t h e s t a ck t h e a d d re s s o f t h e in s t ru ct io n s h o wn in t h e fo llo win g it e m a n d la b e le d 1, wh ich wa s p u s h e d b y t h e switch_to m a cro . If next wa s n e ve r s u s p e n d e d b e fo re b e ca u s e it is b e in g e xe cu t e d fo r t h e firs t t im e , t h e fu n ct io n fin d s t h e s t a rt in g a d d re s s o f t h e ret_from_fork( ) fu n ct io n ( s e e S e ct io n 3 . 4 . 1 la t e r in t h is ch a p t e r) . ●

In clu d e s a fe w in s t ru ct io n s t h a t re s t o re t h e co n t e n t s o f t h e esi, edi, a n d ebp re g is t e rs . Th e firs t o f

t h e s e t h re e in s t ru ct io n s is la b e le d 1:

1: popl %ebp popl %edi popl %esi No t ice h o w t h e s e pop in s t ru ct io n s re fe r t o t h e ke rn e l s t a ck o f t h e prev p ro ce s s . Th e y will b e e xe cu t e d wh e n t h e s ch e d u le r s e le ct s prev a s t h e n e w p ro ce s s t o b e e xe cu t e d o n t h e CPU, t h u s in vo kin g

switch_to wit h prev a s t h e s e co n d p a ra m e t e r. Th e re fo re , t h e esp re g is t e r p o in t s t o t h e prev's Ke rn e l Mo d e s t a ck. ●

Co p ie s t h e co n t e n t o f t h e ebx re g is t e r ( co rre s p o n d in g t o t h e last p a ra m e t e r o f t h e switch_to

m a cro ) in t o t h e prev lo ca l va ria b le :

movl %ebx,prev As d is cu s s e d e a rlie r, t h e ebx re g is t e r p o in t s t o t h e d e s crip t o r o f t h e p ro ce s s t h a t h a s ju s t b e e n re p la ce d .

3.3.4 Saving the FPU, MMX, and XMM Registers S t a rt in g wit h t h e In t e l 8 0 4 8 6 , t h e a rit h m e t ic flo a t in g - p o in t u n it ( FPU) h a s b e e n in t e g ra t e d in t o t h e CPU. Th e n a m e m a t h e m a t ica l co p ro ce s s o r co n t in u e s t o b e u s e d in m e m o ry o f t h e d a ys wh e n flo a t in g - p o in t co m p u t a t io n s we re e xe cu t e d b y a n e xp e n s ive s p e cia l- p u rp o s e ch ip . To m a in t a in co m p a t ib ilit y wit h o ld e r m o d e ls , h o we ve r, flo a t in g - p o in t a rit h m e t ic fu n ct io n s a re p e rfo rm e d wit h ES CAPE in s t ru ct io n s , wh ich a re

in s t ru ct io n s wit h a p re fix b yt e ra n g in g b e t we e n 0xd8 a n d 0xdf. Th e s e in s t ru ct io n s a ct o n t h e s e t o f flo a t in g p o in t re g is t e rs in clu d e d in t h e CPU. Cle a rly, if a p ro ce s s is u s in g ES CAPE in s t ru ct io n s , t h e co n t e n t s o f t h e flo a t in g p o in t re g is t e rs b e lo n g t o it s h a rd wa re co n t e xt . In la t e r Pe n t iu m m o d e ls , In t e l in t ro d u ce d a n e w s e t o f a s s e m b ly la n g u a g e in s t ru ct io n s in t o it s m icro p ro ce s s o rs . Th e y a re ca lle d MMX in s t ru ct io n s a n d a re s u p p o s e d t o s p e e d u p t h e e xe cu t io n o f m u lt im e d ia a p p lica t io n s . MMX in s t ru ct io n s a ct o n t h e flo a t in g p o in t re g is t e rs o f t h e FPU. Th e o b vio u s d is a d va n t a g e o f t h is a rch it e ct u ra l ch o ice is t h a t p ro g ra m m e rs ca n n o t m ix flo a t in g - p o in t in s t ru ct io n s a n d MMX in s t ru ct io n s . Th e a d va n t a g e is t h a t o p e ra t in g s ys t e m d e s ig n e rs ca n ig n o re t h e n e w in s t ru ct io n s e t , s in ce t h e s a m e fa cilit y o f t h e t a s k- s wit ch in g co d e fo r s a vin g t h e s t a t e o f t h e flo a t in g - p o in t u n it ca n a ls o b e re lie d u p o n t o s a ve t h e MMX s t a t e . MMX in s t ru ct io n s s p e e d u p m u lt im e d ia a p p lica t io n s b e ca u s e t h e y in t ro d u ce a s in g le - in s t ru ct io n m u lt ip le d a t a ( S IMD) p ip e lin e in s id e t h e p ro ce s s o r. Th e Pe n t iu m III m o d e l e xt e n d s s u ch S IMD ca p a b ilit y: it in t ro d u ce s t h e S S E e x t e n s io n s ( S t re a m in g S IMD Ext e n s io n s ) , wh ich a d d s fa cilit ie s fo r h a n d lin g flo a t in g p o in t va lu e s co n t a in e d in e ig h t 1 2 8 - b it re g is t e rs ( t h e XMM re g is t e rs ) . S u ch re g is t e rs d o n o t o ve rla p wit h t h e FPU a n d MMX re g is t e rs , s o S S E a n d FPU/ MMX in s t ru ct io n s m a y b e fre e ly m ixe d . Th e Pe n t iu m 4 m o d e l in t ro d u ce s ye t a n o t h e r fe a t u re : t h e S S E2 e xt e n s io n s , wh ich is b a s ica lly a n e xt e n s io n o f S S E s u p p o rt in g h ig h e r- p re cis io n flo a t in g - p o in t va lu e s . S S E2 u s e s t h e s a m e s e t o f XMM re g is t e rs a s S S E. Th e 8 0 x 8 6 m icro p ro ce s s o rs d o n o t a u t o m a t ica lly s a ve t h e FPU, MMX, a n d XMM re g is t e rs in t h e TS S . Ho we ve r, t h e y in clu d e s o m e h a rd wa re s u p p o rt t h a t e n a b le s ke rn e ls t o s a ve t h e s e re g is t e rs o n ly wh e n n e e d e d . Th e h a rd wa re s u p p o rt co n s is t s o f a TS ( Ta s k- S wit ch in g ) fla g in t h e cr0 re g is t e r, wh ich o b e ys t h e fo llo win g ru le s : ●

Eve ry t im e a h a rd wa re co n t e xt s wit ch is p e rfo rm e d , t h e TS fla g is s e t .



Eve ry t im e a n ES CAPE, MMX, S S E, o r S S E2 in s t ru ct io n is e xe cu t e d wh e n t h e TS fla g is s e t , t h e co n t ro l u n it ra is e s a "De vice n o t a va ila b le " e xce p t io n ( s e e Ch a p t e r 4 ) .

Th e TS fla g a llo ws t h e ke rn e l t o s a ve a n d re s t o re t h e FPU, MMX, a n d XMM re g is t e rs o n ly wh e n re a lly n e e d e d . To illu s t ra t e h o w it wo rks , s u p p o s e t h a t a p ro ce s s A is u s in g t h e m a t h e m a t ica l co p ro ce s s o r. Wh e n a co n t e xt s wit ch o ccu rs , t h e ke rn e l s e t s t h e TS fla g a n d s a ve s t h e flo a t in g p o in t re g is t e rs in t o t h e TS S o f p ro ce s s A. If t h e n e w p ro ce s s B d o e s n o t u s e t h e m a t h e m a t ica l co p ro ce s s o r, t h e ke rn e l wo n 't n e e d t o re s t o re t h e co n t e n t s o f t h e flo a t in g p o in t re g is t e rs . Bu t a s s o o n a s B t rie s t o e xe cu t e a n ES CAPE o r MMX in s t ru ct io n , t h e CPU ra is e s a "De vice n o t a va ila b le " e xce p t io n , a n d t h e co rre s p o n d in g h a n d le r lo a d s t h e flo a t in g p o in t re g is t e rs wit h t h e va lu e s s a ve d in t h e TS S o f p ro ce s s B. Le t 's n o w d e s crib e t h e d a t a s t ru ct u re s in t ro d u ce d t o h a n d le s e le ct ive lo a d in g o f t h e FPU, MMX, a n d XMM re g is t e rs . Th e y a re s t o re d in t h e thread.i387 s u b fie ld o f t h e p ro ce s s d e s crip t o r, wh o s e fo rm a t is d e s crib e d b y t h e i387_union u n io n :

union i387_union { struct i387_fsave_struct struct i387_fxsave_struct struct i387_soft_struct };

fsave; fxsave; soft;

As yo u s e e , t h e fie ld m a y s t o re ju s t o n e o f t h re e d iffe re n t t yp e s o f d a t a s t ru ct u re s . Th e i387_soft_struct t yp e is u s e d b y CPU m o d e ls wit h o u t a m a t h e m a t ica l co p ro ce s s o r; t h e Lin u x ke rn e l s t ill s u p p o rt s t h e s e o ld ch ip s b y e m u la t in g t h e co p ro ce s s o r via s o ft wa re . We d o n 't d is cu s s t h is le g a cy ca s e fu rt h e r, h o we ve r. Th e i387_fsave_struct t yp e is u s e d b y CPU m o d e ls wit h a m a t h e m a t ica l co p ro ce s s o r a n d , o p t io n a lly, a MMX u n it . Fin a lly, t h e i387_fxsave_struct t yp e is u s e d b y CPU m o d e ls fe a t u rin g S S E a n d S S E2 e xt e n s io n s . Th e p ro ce s s d e s crip t o r in clu d e s t wo a d d it io n a l fla g s : ●

Th e PF_USEDFPU fla g , wh ich is in clu d e d in t h e flags fie ld . It s p e cifie s wh e t h e r t h e p ro ce s s u s e d t h e FPU, MMX, o r XMM re g is t e rs in t h e cu rre n t e xe cu t io n ru n .



Th e used_math fie ld . Th is fla g s p e cifie s wh e t h e r t h e co n t e n t s o f t h e thread.i387 s u b fie ld a re s ig n ifica n t . Th e fla g is cle a re d ( n o t s ig n ifica n t ) in t wo ca s e s , s h o wn in t h e fo llo win g lis t . ❍

Wh e n t h e p ro ce s s s t a rt s e xe cu t in g a n e w p ro g ra m b y in vo kin g a n execve( ) s ys t e m ca ll ( s e e Ch a p t e r 2 0 ) . S in ce co n t ro l will n e ve r re t u rn t o t h e fo rm e r p ro g ra m , t h e d a t a cu rre n t ly s t o re d in thread.i387 is n e ve r u s e d a g a in .



Wh e n a p ro ce s s t h a t wa s e xe cu t in g a p ro g ra m in Us e r Mo d e s t a rt s e xe cu t in g a s ig n a l h a n d le r p ro ce d u re ( s e e Ch a p t e r 1 0 ) . S in ce s ig n a l h a n d le rs a re a s yn ch ro n o u s wit h re s p e ct t o t h e p ro g ra m e xe cu t io n flo w, t h e flo a t in g p o in t re g is t e rs co u ld b e m e a n in g le s s t o t h e s ig n a l h a n d le r. Ho we ve r, t h e ke rn e l s a ve s t h e flo a t in g p o in t re g is t e rs in thread.i387 b e fo re s t a rt in g t h e h a n d le r a n d re s t o re s t h e m a ft e r t h e h a n d le r t e rm in a t e s . Th e re fo re , a s ig n a l h a n d le r is a llo we d t o u s e t h e m a t h e m a t ica l co p ro ce s s o r, b u t it ca n n o t ca rry o n a flo a t in g - p o in t co m p u t a t io n s t a rt e d d u rin g t h e n o rm a l p ro g ra m e xe cu t io n flo w.

As s t a t e d e a rlie r, t h e _ _switch_to( ) fu n ct io n e xe cu t e s t h e unlazy_fpu m a cro , p a s s in g t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s b e in g re p la ce d a s a n a rg u m e n t . Th e m a cro ch e cks t h e va lu e o f t h e PF_USEDFPU fla g s o f prev. If t h e fla g is s e t , prev h a s u s e d a FPU, MMX, S S E, o r S S E2 in s t ru ct io n s in t h is ru n o f e xe cu t io n ; t h e re fo re , t h e ke rn e l m u s t s a ve t h e re la t ive h a rd wa re co n t e xt :

if (prev->flags & PF_USEDFPU) save_init_fpu(prev); Th e save_init_fpu( ) fu n ct io n , in t u rn , e xe cu t e s t h e fo llo win g o p e ra t io n s :

1 . Du m p s t h e co n t e n t s o f t h e FPU re g is t e rs in t h e p ro ce s s d e s crip t o r o f prev a n d t h e n re - in it ia lize s t h e FPU. If t h e CPU u s e s S S E/ S S E2 e xt e n s io n s , it a ls o d u m p s t h e co n t e n t s o f t h e XMM re g is t e rs a n d re - in it ia lize t h e S S E/ S S E2 u n it . A co u p le o f p o we rfu l a s s e m b ly la n g u a g e in s t ru ct io n s t a ke ca re o f e ve ryt h in g , e it h e r: asm volatile( "fxsave %0 ; fnclex" : "=m" (tsk->thread.i387.fxsave) );

if t h e CPU u s e s S S E/ S S E2 e xt e n s io n s , o r o t h e rwis e : asm volatile( "fnsave %0 ; fwait" : "=m" (tsk->thread.i387.fsave) );

2 . Re s e t s t h e PF_USEDFPU fla g o f prev: prev->flags &= ~PF_USEDFPU;

3 . S e t s t h e TS fla g o f cr0 b y m e a n s o f t h e stts( ) macro, wh ich in p ra ct ice yie ld s t h e fo llo win g a s s e m b ly la n g u a g e in s t ru ct io n s : movl %cr0, %eax orl $8,%eax movl %eax, %cr0

Th e co n t e n t s o f t h e flo a t in g p o in t re g is t e rs a re n o t re s t o re d rig h t a ft e r a p ro ce s s re s u m e s e xe cu t io n . Ho we ve r, t h e TS fla g o f cr0 h a s b e e n s e t b y unlazy_fpu( ). Th u s , t h e firs t t im e t h e p ro ce s s t rie s t o e xe cu t e a n ES CAPE, MMX, o r S S E/ S S E2 in s t ru ct io n , t h e co n t ro l u n it ra is e s a "De vice n o t a va ila b le " e xce p t io n , a n d t h e ke rn e l ( m o re p re cis e ly, t h e e xce p t io n h a n d le r in vo lve d b y t h e e xce p t io n ) ru n s t h e math_state_restore( ) fu n ct io n :

void math_state_restore( )

{ asm("clts"); /* clear the TS flag of cr0 */ if (current->used_math) { restore_fpu(current); } else { /* initialize the FPU unit */ asm("fninit"); /* and also the SSE/SSE2 unit, if present */ if ( cpu_has_xmm ) load_mxcsr(0x1f80); current->used_math = 1; } current->flags |= PF_USEDFPU; } S in ce t h e p ro ce s s is e xe cu t in g a n FPU, MMX, o r S S E/ S S E2 in s t ru ct io n , t h is fu n ct io n s e t s t h e PF_USEDFPU fla g . Mo re o ve r, t h e fu n ct io n cle a rs t h e TS fla g s o f cr0 s o t h a t fu rt h e r FPU, MMX, o r S S E/ S S E2 in s t ru ct io n s e xe cu t e d b y t h e p ro ce s s wo n 't t rig g e r t h e "De vice is n o t a va ila b le " e xce p t io n . If t h e d a t a s t o re d in t h e thread.i387 fie ld is va lid , t h e restore_fpu( ) fu n ct io n lo a d s t h e re g is t e rs wit h t h e p ro p e r va lu e s . To d o t h is , e it h e r t h e fxrstor o r t h e frstor a s s e m b ly la n g u a g e in s t ru ct io n s a re u s e d , d e p e n d in g o n wh e t h e r t h e CPU s u p p o rt s S S E/ S S E2 e xt e n s io n s . Ot h e rwis e , if t h e d a t a s t o re d in t h e thread.i387 fie ld is n o t va lid , t h e FPU/ MMX u n it is re - in it ia lize d a n d a ll it s re g is t e rs a re cle a re d . To re - in it ia lize t h e S S E/ S S E2 u n it , it is s u fficie n t t o lo a d a va lu e in a XMM re g is t e r.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

3.4 Creating Processes Un ix o p e ra t in g s ys t e m s re ly h e a vily o n p ro ce s s cre a t io n t o s a t is fy u s e r re q u e s t s . Fo r e xa m p le , t h e s h e ll cre a t e s a n e w p ro ce s s t h a t e xe cu t e s a n o t h e r co p y o f t h e s h e ll wh e n e ve r t h e u s e r e n t e rs a co m m a n d . Tra d it io n a l Un ix s ys t e m s t re a t a ll p ro ce s s e s in t h e s a m e wa y: re s o u rce s o wn e d b y t h e p a re n t p ro ce s s a re d u p lica t e d in t h e ch ild p ro ce s s . Th is a p p ro a ch m a ke s p ro ce s s cre a t io n ve ry s lo w a n d in e fficie n t , s in ce it re q u ire s co p yin g t h e e n t ire a d d re s s s p a ce o f t h e p a re n t p ro ce s s . Th e ch ild p ro ce s s ra re ly n e e d s t o re a d o r m o d ify a ll t h e re s o u rce s in h e rit e d fro m t h e p a re n t ; in m a n y ca s e s , it is s u e s a n im m e d ia t e execve( ) a n d wip e s o u t t h e a d d re s s s p a ce t h a t wa s s o ca re fu lly co p ie d . Mo d e rn Un ix ke rn e ls s o lve t h is p ro b le m b y in t ro d u cin g t h re e d iffe re n t m e ch a n is m s : ●





Th e Co p y On Writ e t e ch n iq u e a llo ws b o t h t h e p a re n t a n d t h e ch ild t o re a d t h e s a m e p h ys ica l p a g e s . Wh e n e ve r e it h e r o n e t rie s t o writ e o n a p h ys ica l p a g e , t h e ke rn e l co p ie s it s co n t e n t s in t o a n e w p h ys ica l p a g e t h a t is a s s ig n e d t o t h e writ in g p ro ce s s . Th e im p le m e n t a t io n o f t h is t e ch n iq u e in Lin u x is fu lly e xp la in e d in Ch a p t e r 8 . Lig h t we ig h t p ro ce s s e s a llo w b o t h t h e p a re n t a n d t h e ch ild t o s h a re m a n y p e r- p ro ce s s ke rn e l d a t a s t ru ct u re s , s u ch a s t h e p a g in g t a b le s ( a n d t h e re fo re t h e e n t ire Us e r Mo d e a d d re s s s p a ce ) , t h e o p e n file t a b le s , a n d t h e s ig n a l d is p o s it io n s . Th e vfork( ) s ys t e m ca ll cre a t e s a p ro ce s s t h a t s h a re s t h e m e m o ry a d d re s s s p a ce o f it s p a re n t . To p re ve n t t h e p a re n t fro m o ve rwrit in g d a t a n e e d e d b y t h e ch ild , t h e p a re n t 's e xe cu t io n is b lo cke d u n t il t h e ch ild e xit s o r e xe cu t e s a n e w p ro g ra m . We 'll le a rn m o re a b o u t t h e vfork( ) s ys t e m ca ll in t h e fo llo win g s e ct io n .

3.4.1 The clone( ), fork( ), and vfork( ) System Calls Lig h t we ig h t p ro ce s s e s a re cre a t e d in Lin u x b y u s in g a fu n ct io n n a m e d clone( ), wh ich u s e s fo u r p a ra m e t e rs :

fn S p e cifie s a fu n ct io n t o b e e xe cu t e d b y t h e n e w p ro ce s s ; wh e n t h e fu n ct io n re t u rn s , t h e ch ild t e rm in a t e s . Th e fu n ct io n re t u rn s a n in t e g e r, wh ich re p re s e n t s t h e e xit co d e fo r t h e ch ild p ro ce s s .

arg Po in t s t o d a t a p a s s e d t o t h e fn( ) fu n ct io n .

flags Mis ce lla n e o u s in fo rm a t io n . Th e lo w b yt e s p e cifie s t h e s ig n a l n u m b e r t o b e s e n t t o t h e p a re n t p ro ce s s wh e n t h e ch ild t e rm in a t e s ; t h e SIGCHLD s ig n a l is g e n e ra lly s e le ct e d . Th e re m a in in g t h re e b yt e s e n co d e a g ro u p o f clo n e fla g s , wh ich s p e cify t h e re s o u rce s t o b e s h a re d b e t we e n t h e p a re n t a n d t h e ch ild p ro ce s s a s fo llo ws :

CLONE_VM

S h a re s t h e m e m o ry d e s crip t o r a n d a ll Pa g e Ta b le s ( s e e Ch a p t e r 8 ) .

CLONE_FS

S h a re s t h e t a b le t h a t id e n t ifie s t h e ro o t d ire ct o ry a n d t h e cu rre n t wo rkin g d ire ct o ry, a s we ll a s t h e va lu e o f t h e b it m a s k u s e d t o m a s k t h e in it ia l file p e rm is s io n s o f a n e w file ( t h e s o - ca lle d file u m a s k ) .

CLONE_FILES

S h a re s t h e t a b le t h a t id e n t ifie s t h e o p e n file s ( s e e Ch a p t e r 1 2 ) .

CLONE_PARENT

S e t s t h e p a re n t o f t h e ch ild ( p_pptr a n d p_opptr fie ld s in t h e p ro ce s s d e s crip t o r) t o t h e p a re n t o f t h e ca llin g p ro ce s s .

CLONE_PID

S h a re s t h e PID.

[6]

[6]

As we s h a ll s e e la t e r, t h e CLONE_PID fla g ca n b e u s e d o n ly b y a p ro ce s s h a vin g a PID o f 0 ;

in a u n ip ro ce s s o r s ys t e m , n o t wo lig h t we ig h t p ro ce s s e s h a ve t h e s a m e PID.

CLONE_PTRACE If a ptrace( ) s ys t e m ca ll is ca u s in g t h e p a re n t p ro ce s s t o b e t ra ce d , t h e ch ild will a ls o b e t ra ce d .

CLONE_SIGHAND S h a re s t h e t a b le t h a t id e n t ifie s t h e s ig n a l h a n d le rs ( s e e Ch a p t e r 1 0 ) .

CLONE_THREAD In s e rt s t h e ch ild in t o t h e s a m e t h re a d g ro u p o f t h e p a re n t , a n d t h e ch ild 's tgid fie ld is s e t a cco rd in g ly. If t h is fla g is t ru e , it im p licit ly e n fo rce s CLONE_PARENT.

CLONE_SIGNAL Eq u iva le n t t o s e t t in g b o t h CLONE_SIGHAND a n d CLONE_THREAD, s o t h a t it is p o s s ib le t o s e n d a s ig n a l t o a ll t h re a d s o f a m u lt it h re a d e d a p p lica t io n .

CLONE_VFORK Us e d fo r t h e vfork( ) s ys t e m ca ll ( s e e la t e r in t h is s e ct io n ) .

child_stack

S p e cifie s t h e Us e r Mo d e s t a ck p o in t e r t o b e a s s ig n e d t o t h e esp re g is t e r o f t h e ch ild p ro ce s s . If it is e q u a l t o 0 , t h e ke rn e l a s s ig n s t h e cu rre n t p a re n t s t a ck p o in t e r t o t h e ch ild . Th e re fo re , t h e p a re n t a n d ch ild t e m p o ra rily s h a re t h e s a m e Us e r Mo d e s t a ck. Bu t t h a n ks t o t h e Co p y On Writ e m e ch a n is m , t h e y u s u a lly g e t s e p a ra t e co p ie s o f t h e Us e r Mo d e s t a ck a s s o o n a s o n e t rie s t o ch a n g e t h e s t a ck. Ho we ve r, t h is p a ra m e t e r m u s t h a ve a n o n - n u ll va lu e if t h e ch ild p ro ce s s s h a re s t h e s a m e a d d re s s s p a ce a s t h e p a re n t .

clone( ) is a ct u a lly a wra p p e r fu n ct io n d e fin e d in t h e C lib ra ry ( s e e S e ct io n 9 . 1 ) , wh ich in t u rn u s e s a clone( ) s ys t e m ca ll h id d e n t o t h e p ro g ra m m e r. Th is s ys t e m ca ll re ce ive s o n ly t h e flags a n d child_stack p a ra m e t e rs ; t h e n e w p ro ce s s a lwa ys s t a rt s it s e xe cu t io n fro m t h e in s t ru ct io n fo llo win g t h e s ys t e m ca ll in vo ca t io n . Wh e n t h e s ys t e m ca ll re t u rn s t o t h e clone( ) fu n ct io n , it d e t e rm in e s wh e t h e r it is in t h e p a re n t o r t h e ch ild a n d fo rce s t h e ch ild t o e xe cu t e t h e fn( ) fu n ct io n .

Th e t ra d it io n a l fork( ) s ys t e m ca ll is im p le m e n t e d b y Lin u x a s a clone( ) s ys t e m ca ll wh o s e flags p a ra m e t e r s p e cifie s b o t h a SIGCHLD s ig n a l a n d a ll t h e clo n e fla g s cle a re d , a n d wh o s e child_stack p a ra m e t e r is 0 .

Th e vfork( ) s ys t e m ca ll, d e s crib e d in t h e p re vio u s s e ct io n , is im p le m e n t e d b y Lin u x a s a

clone( ) s ys t e m ca ll wh o s e firs t p a ra m e t e r s p e cifie s b o t h a SIGCHLD s ig n a l a n d t h e fla g s CLONE_VM a n d CLONE_VFORK, a n d wh o s e s e co n d p a ra m e t e r is e q u a l t o 0 . Wh e n e it h e r a clone( ), fork( ), o r vfork( ) s ys t e m ca ll is is s u e d , t h e ke rn e l in vo ke s t h e do_fork( ) fu n ct io n , wh ich e xe cu t e s t h e fo llo win g s t e p s :

1 . If t h e CLONE_PID fla g is s p e cifie d , t h e do_fork( ) fu n ct io n ch e cks wh e t h e r t h e PID o f t h e p a re n t p ro ce s s is n o t 0 ; if s o , it re t u rn s a n e rro r co d e . On ly t h e s w a p p e r p ro ce s s is a llo we d t o s e t CLONE_PID; t h is is re q u ire d wh e n in it ia lizin g a m u lt ip ro ce s s o r s ys t e m . 2 . Th e alloc_task_struct( ) fu n ct io n is in vo ke d t o g e t a n e w 8 KB union

task_union m e m o ry a re a t o s t o re t h e p ro ce s s d e s crip t o r a n d t h e Ke rn e l Mo d e s t a ck o f t h e n e w p ro ce s s . 3 . Th e fu n ct io n fo llo ws t h e current p o in t e r t o o b t a in t h e p a re n t p ro ce s s d e s crip t o r a n d co p ie s it in t o t h e n e w p ro ce s s d e s crip t o r in t h e m e m o ry a re a ju s t a llo ca t e d . 4 . A fe w ch e cks o ccu r t o m a ke s u re t h e u s e r h a s t h e re s o u rce s n e ce s s a ry t o s t a rt a n e w p ro ce s s . Firs t , t h e fu n ct io n ch e cks wh e t h e r current-

>rlim[RLIMIT_NPROC]. rlim_cur is s m a lle r t h a n o r e q u a l t o t h e cu rre n t n u m b e r o f p ro ce s s e s o wn e d b y t h e u s e r. If s o , a n e rro r co d e is re t u rn e d , u n le s s t h e p ro ce s s h a s ro o t p rivile g e s . Th e fu n ct io n g e t s t h e cu rre n t n u m b e r o f p ro ce s s e s o wn e d b y t h e u s e r fro m a p e r- u s e r d a t a s t ru ct u re n a m e d user_struct. Th is d a t a s t ru ct u re ca n b e fo u n d t h ro u g h a p o in t e r in t h e user fie ld o f t h e p ro ce s s d e s crip t o r.

5 . Th e fu n ct io n ch e cks t h a t t h e n u m b e r o f p ro ce s s e s is s m a lle r t h a n t h e va lu e o f t h e max_threads va ria b le . Th e in it ia l va lu e o f t h is va ria b le d e p e n d s o n t h e a m o u n t o f RAM in t h e s ys t e m . Th e g e n e ra l ru le is t h a t t h e s p a ce t a ke n b y a ll p ro ce s s

d e s crip t o rs a n d Ke rn e l Mo d e s t a cks ca n n o t e xce e d 1 / 8 o f t h e p h ys ica l m e m o ry. Ho we ve r, t h e s ys t e m a d m in is t ra t o r m a y ch a n g e t h is va lu e b y writ in g in t h e / p ro c/ s y s / k e rn e l/ t h re a d s - m a x file . 6 . If t h e p a re n t p ro ce s s u s e s a n y ke rn e l m o d u le s , t h e fu n ct io n in cre m e n t s t h e co rre s p o n d in g re fe re n ce co u n t e rs . As we s h a ll s e e in Ap p e n d ix B, e a ch ke rn e l m o d u le h a s it s o wn re fe re n ce co u n t e r, wh ich e n s u re s t h a t t h e m o d u le will n o t b e u n lo a d e d wh ile it is b e in g u s e d . 7 . Th e fu n ct io n t h e n u p d a t e s s o m e o f t h e fla g s in clu d e d in t h e flags fie ld t h a t h a ve b e e n co p ie d fro m t h e p a re n t p ro ce s s : a . It cle a rs t h e PF_SUPERPRIV fla g , wh ich in d ica t e s wh e t h e r t h e p ro ce s s h a s u s e d a n y o f it s s u p e ru s e r p rivile g e s . b . It cle a rs t h e PF_USEDFPU fla g .

c. It s e t s t h e PF_FORKNOEXEC fla g , wh ich in d ica t e s t h a t t h e ch ild p ro ce s s h a s n o t ye t is s u e d a n execve( ) s ys t e m ca ll.

8 . No w t h e fu n ct io n h a s t a ke n a lm o s t e ve ryt h in g t h a t it ca n u s e fro m t h e p a re n t p ro ce s s ; t h e re s t o f it s a ct ivit ie s fo cu s o n s e t t in g u p n e w re s o u rce s in t h e ch ild a n d le t t in g t h e ke rn e l kn o w t h a t t h is n e w p ro ce s s h a s b e e n b o rn . Firs t , t h e fu n ct io n in vo ke s t h e get_pid( ) fu n ct io n t o o b t a in a n e w PID, wh ich will b e a s s ig n e d t o t h e ch ild p ro ce s s ( u n le s s t h e CLONE_PID fla g is s e t ) .

9 . Th e fu n ct io n t h e n u p d a t e s a ll t h e p ro ce s s d e s crip t o r fie ld s t h a t ca n n o t b e in h e rit e d fro m t h e p a re n t p ro ce s s , s u ch a s t h e fie ld s t h a t s p e cify t h e p ro ce s s p a re n t h o o d re la t io n s h ip s . 1 0 . Un le s s s p e cifie d d iffe re n t ly b y t h e flags p a ra m e t e r, it in vo ke s copy_files( ),

copy_fs( ), copy_sighand( ), a n d copy_mm( ) t o cre a t e n e w d a t a s t ru ct u re s a n d co p y in t o t h e m t h e va lu e s o f t h e co rre s p o n d in g p a re n t p ro ce s s d a t a s t ru ct u re s . 1 1 . Th e do_fork( ) fu n ct io n in vo ke s copy_thread( ) t o in it ia lize t h e Ke rn e l Mo d e s t a ck o f t h e ch ild p ro ce s s wit h t h e va lu e s co n t a in e d in t h e CPU re g is t e rs wh e n t h e clone( ) ca ll wa s is s u e d ( t h e s e va lu e s h a ve b e e n s a ve d in t h e Ke rn e l Mo d e s t a ck o f t h e p a re n t , a s d e s crib e d in Ch a p t e r 9 ) . Ho we ve r, t h e fu n ct io n fo rce s t h e va lu e 0 in t o t h e fie ld co rre s p o n d in g t o t h e eax re g is t e r. Th e thread.esp fie ld in t h e d e s crip t o r o f t h e ch ild p ro ce s s is in it ia lize d wit h t h e b a s e a d d re s s o f t h e ch ild 's Ke rn e l Mo d e s t a ck, a n d t h e a d d re s s o f a n a s s e m b ly la n g u a g e fu n ct io n ( ret_from_fork(

)) is s t o re d in t h e thread.eip fie ld . Th e copy_thread( ) fu n ct io n a ls o in vo ke s unlazy_fpu( ) o n t h e p a re n t a n d d u p lica t e s t h e co n t e n t s o f t h e thread.i387 fie ld . 1 2 . If e it h e r CLONE_THREAD o r CLONE_PARENT is s e t , t h e fu n ct io n co p ie s t h e va lu e o f t h e p_opptr a n d p_pptr fie ld s o f t h e p a re n t in t o t h e co rre s p o n d in g fie ld s o f t h e ch ild . Th e p a re n t o f t h e ch ild t h u s a p p e a rs a s t h e p a re n t o f t h e cu rre n t p ro ce s s . Ot h e rwis e , t h e fu n ct io n s t o re s t h e p ro ce s s d e s crip t o r a d d re s s o f current in t o t h e

p_opptr a n d p_pptr fie ld s o f t h e ch ild . 1 3 . If t h e CLONE_PTRACE fla g is n o t s e t , t h e fu n ct io n s e t s t h e ptrace fie ld in t h e ch ild p ro ce s s d e s crip t o r t o 0 . Th is fie ld s t o re s a fe w fla g s u s e d wh e n a p ro ce s s is b e in g t ra ce d b y a n o t h e r p ro ce s s . Eve n if t h e cu rre n t p ro ce s s is b e in g t ra ce d , t h e ch ild will not. 1 4 . Co n ve rs e ly, if t h e CLONE_PTRACE fla g is s e t , t h e fu n ct io n ch e cks wh e t h e r t h e p a re n t p ro ce s s is b e in g t ra ce d b e ca u s e in t h is ca s e , t h e ch ild s h o u ld b e t ra ce d t o o . Th e re fo re , if PT_PTRACED is s e t in current->ptrace, t h e fu n ct io n co p ie s t h e

current->p_pptr fie ld in t o t h e co rre s p o n d in g fie ld o f t h e ch ild . 1 5 . Th e do_fork( ) fu n ct io n ch e cks t h e va lu e o f CLONE_THREAD. If t h e fla g is s e t , t h e fu n ct io n in s e rt s t h e ch ild in t h e t h re a d g ro u p o f t h e p a re n t a n d co p ie s in t h e tgid fie ld t h e va lu e o f t h e p a re n t 's tgid; o t h e rwis e , t h e fu n ct io n s e t s t h e tgid fie ld t o t h e va lu e o f t h e pid fie ld .

1 6 . Th e fu n ct io n u s e s t h e SET_LINKS m a cro t o in s e rt t h e n e w p ro ce s s d e s crip t o r in t h e p ro ce s s lis t . 1 7 . Th e fu n ct io n in vo ke s hash_pid( ) t o in s e rt t h e n e w p ro ce s s d e s crip t o r in t h e

pidhash h a s h t a b le . 1 8 . Th e fu n ct io n in cre m e n t s t h e va lu e s o f nr_threads a n d current->user-

>processes. 1 9 . If t h e ch ild is b e in g t ra ce d , t h e fu n ct io n s e n d s a SIGSTOP s ig n a l t o it s o t h a t t h e d e b u g g e r h a s a ch a n ce t o lo o k a t it b e fo re it s t a rt s t h e e xe cu t io n . 2 0 . It in vo ke s wake_up_process( ) t o s e t t h e state fie ld o f t h e ch ild p ro ce s s d e s crip t o r t o TASK_RUNNING a n d t o in s e rt t h e ch ild in t h e ru n q u e u e lis t .

2 1 . If t h e CLONE_VFORK fla g is s p e cifie d , t h e fu n ct io n in s e rt s t h e p a re n t p ro ce s s in a wa it q u e u e a n d s u s p e n d s it u n t il t h e ch ild re le a s e s it s m e m o ry a d d re s s s p a ce ( t h a t is , u n t il t h e ch ild e it h e r t e rm in a t e s o r e xe cu t e s a n e w p ro g ra m ) . 2 2 . Th e do_fork( ) fu n ct io n re t u rn s t h e PID o f t h e ch ild , wh ich is e ve n t u a lly re a d b y t h e p a re n t p ro ce s s in Us e r Mo d e . No w we h a ve a co m p le t e ch ild p ro ce s s in t h e ru n n a b le s t a t e . Bu t it is n 't a ct u a lly ru n n in g . It is u p t o t h e s ch e d u le r t o d e cid e wh e n t o g ive t h e CPU t o t h is ch ild . At s o m e fu t u re p ro ce s s s wit ch , t h e s ch e d u le b e s t o ws t h is fa vo r o n t h e ch ild p ro ce s s b y lo a d in g a fe w CPU re g is t e rs wit h t h e va lu e s o f t h e thread fie ld o f t h e ch ild 's p ro ce s s d e s crip t o r. In p a rt icu la r, esp is lo a d e d wit h thread.esp ( t h a t is , wit h t h e a d d re s s o f ch ild 's Ke rn e l Mo d e s t a ck) , a n d eip is lo a d e d wit h t h e a d d re s s o f ret_from_fork( ). Th is a s s e m b ly la n g u a g e fu n ct io n , in t u rn , in vo ke s t h e ret_from_sys_call( ) fu n ct io n ( s e e Ch a p t e r 9 ) , wh ich re lo a d s a ll o t h e r re g is t e rs wit h t h e va lu e s s t o re d in t h e s t a ck a n d fo rce s t h e CPU b a ck t o Us e r Mo d e . Th e n e w p ro ce s s t h e n s t a rt s it s e xe cu t io n rig h t a t t h e e n d o f t h e fork( ), vfork( ), o r clone( )

s ys t e m ca ll. Th e va lu e re t u rn e d b y t h e s ys t e m ca ll is co n t a in e d in eax: t h e va lu e is 0 fo r t h e ch ild a n d e q u a l t o t h e PID fo r t h e ch ild 's p a re n t . Th e ch ild p ro ce s s e xe cu t e s t h e s a m e co d e a s t h e p a re n t , e xce p t t h a t t h e fo rk re t u rn s a 0 . Th e d e ve lo p e r o f t h e a p p lica t io n ca n e xp lo it t h is fa ct , in a m a n n e r fa m ilia r t o Un ix p ro g ra m m e rs , b y in s e rt in g a co n d it io n a l s t a t e m e n t in t h e p ro g ra m b a s e d o n t h e PID va lu e t h a t fo rce s t h e ch ild t o b e h a ve d iffe re n t ly fro m t h e p a re n t p ro ce s s .

3.4.2 Kernel Threads Tra d it io n a l Un ix s ys t e m s d e le g a t e s o m e crit ica l t a s ks t o in t e rm it t e n t ly ru n n in g p ro ce s s e s , in clu d in g flu s h in g d is k ca ch e s , s wa p p in g o u t u n u s e d p a g e fra m e s , s e rvicin g n e t wo rk co n n e ct io n s , a n d s o o n . In d e e d , it is n o t e fficie n t t o p e rfo rm t h e s e t a s ks in s t rict lin e a r fa s h io n ; b o t h t h e ir fu n ct io n s a n d t h e e n d u s e r p ro ce s s e s g e t b e t t e r re s p o n s e s if t h e y a re s ch e d u le d in t h e b a ckg ro u n d . S in ce s o m e o f t h e s ys t e m p ro ce s s e s ru n o n ly in Ke rn e l Mo d e , m o d e rn o p e ra t in g s ys t e m s d e le g a t e t h e ir fu n ct io n s t o k e rn e l t h re a d s , wh ich a re n o t e n cu m b e re d wit h t h e u n n e ce s s a ry Us e r Mo d e co n t e xt . In Lin u x, ke rn e l t h re a d s d iffe r fro m re g u la r p ro ce s s e s in t h e fo llo win g wa ys : ●





Ea ch ke rn e l t h re a d e xe cu t e s a s in g le s p e cific ke rn e l C fu n ct io n , wh ile re g u la r p ro ce s s e s e xe cu t e ke rn e l fu n ct io n s o n ly t h ro u g h s ys t e m ca lls . Ke rn e l t h re a d s ru n o n ly in Ke rn e l Mo d e , wh ile re g u la r p ro ce s s e s ru n a lt e rn a t ive ly in Ke rn e l Mo d e a n d in Us e r Mo d e . S in ce ke rn e l t h re a d s ru n o n ly in Ke rn e l Mo d e , t h e y u s e o n ly lin e a r a d d re s s e s g re a t e r t h a n PAGE_OFFSET. Re g u la r p ro ce s s e s , o n t h e o t h e r h a n d , u s e a ll fo u r g ig a b yt e s o f lin e a r a d d re s s e s , in e it h e r Us e r Mo d e o r Ke rn e l Mo d e .

3.4.2.1 Creating a kernel thread Th e kernel_thread( ) fu n ct io n cre a t e s a n e w ke rn e l t h re a d a n d ca n b e e xe cu t e d o n ly b y a n o t h e r ke rn e l t h re a d . Th e fu n ct io n co n t a in s m o s t ly in lin e a s s e m b ly la n g u a g e co d e , b u t it is ro u g h ly e q u iva le n t t o t h e fo llo win g :

int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) { int p; p = clone( 0, flags | CLONE_VM ); if ( p ) /* parent */ return p; else { /* child */ fn(arg); exit( ); } } 3.4.2.2 Process 0 Th e a n ce s t o r o f a ll p ro ce s s e s , ca lle d p ro ce s s 0 o r, fo r h is t o rica l re a s o n s , t h e s w a p p e r p ro ce s s , is a ke rn e l t h re a d cre a t e d fro m s cra t ch d u rin g t h e in it ia liza t io n p h a s e o f Lin u x b y t h e start_kernel( ) fu n ct io n ( s e e Ap p e n d ix A) . Th is a n ce s t o r p ro ce s s u s e s t h e fo llo win g d a t a s t ru ct u re s : ●

A p ro ce s s d e s crip t o r a n d a Ke rn e l Mo d e s t a ck s t o re d in t h e init_task_union

va ria b le . Th e init_task a n d init_stack m a cro s yie ld t h e a d d re s s e s o f t h e ●

p ro ce s s d e s crip t o r a n d t h e s t a ck, re s p e ct ive ly. Th e fo llo win g t a b le s , wh ich t h e p ro ce s s d e s crip t o r p o in t s t o : ❍ ❍ ❍ ❍

init_mm init_fs init_files init_signals

Th e t a b le s a re in it ia lize d , re s p e ct ive ly, b y t h e fo llo win g m a cro s : ❍ ❍ ❍ ❍ ●

INIT_MM INIT_FS INIT_FILES INIT_SIGNALS

Th e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry s t o re d in swapper_pg_dir ( s e e S e ct io n 2.5.5).

Th e start_kernel( ) fu n ct io n in it ia lize s a ll t h e d a t a s t ru ct u re s n e e d e d b y t h e ke rn e l, e n a b le s in t e rru p t s , a n d cre a t e s a n o t h e r ke rn e l t h re a d , n a m e d p ro ce s s 1 ( m o re co m m o n ly re fe rre d t o a s t h e in it p ro ce s s ) :

kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL); Th e n e wly cre a t e d ke rn e l t h re a d h a s PID 1 a n d s h a re s a ll p e r- p ro ce s s ke rn e l d a t a s t ru ct u re s wit h p ro ce s s 0 . Mo re o ve r, wh e n s e le ct e d fro m t h e s ch e d u le r, t h e in it p ro ce s s s t a rt s e xe cu t in g t h e init( ) fu n ct io n .

Aft e r h a vin g cre a t e d t h e in it p ro ce s s , p ro ce s s 0 e xe cu t e s t h e cpu_idle( ) fu n ct io n , wh ich e s s e n t ia lly co n s is t s o f re p e a t e d ly e xe cu t in g t h e hlt a s s e m b ly la n g u a g e in s t ru ct io n wit h t h e in t e rru p t s e n a b le d ( s e e Ch a p t e r 4 ) . Pro ce s s 0 is s e le ct e d b y t h e s ch e d u le r o n ly wh e n t h e re a re n o o t h e r p ro ce s s e s in t h e TASK_RUNNING s t a t e .

3.4.2.3 Process 1 Th e ke rn e l t h re a d cre a t e d b y p ro ce s s 0 e xe cu t e s t h e init( ) fu n ct io n , wh ich in t u rn co m p le t e s t h e in it ia liza t io n o f t h e ke rn e l. Th e n init( ) in vo ke s t h e execve( ) s ys t e m ca ll t o lo a d t h e e xe cu t a b le p ro g ra m in it . As a re s u lt , t h e in it ke rn e l t h re a d b e co m e s a re g u la r p ro ce s s h a vin g it s o wn p e r- p ro ce s s ke rn e l d a t a s t ru ct u re ( s e e Ch a p t e r 2 0 ) . Th e in it p ro ce s s s t a ys a live u n t il t h e s ys t e m is s h u t d o wn , s in ce it cre a t e s a n d m o n it o rs t h e a ct ivit y o f a ll p ro ce s s e s t h a t im p le m e n t t h e o u t e r la ye rs o f t h e o p e ra t in g s ys t e m .

3.4.2.4 Other kernel threads Lin u x u s e s m a n y o t h e r ke rn e l t h re a d s . S o m e o f t h e m a re cre a t e d in t h e in it ia liza t io n p h a s e a n d ru n u n t il s h u t d o wn ; o t h e rs a re cre a t e d "o n d e m a n d , " wh e n t h e ke rn e l m u s t e xe cu t e a t a s k t h a t is b e t t e r p e rfo rm e d in it s o wn e xe cu t io n co n t e xt . Th e m o s t im p o rt a n t ke rn e l t h re a d s ( b e s id e p ro ce s s 0 a n d p ro ce s s 1 ) a re :

ke ve ntd Exe cu t e s t h e t a s ks in t h e qt_context t a s k q u e u e ( s e e S e ct io n 4 . 7 . 3 ) .

k apm Ha n d le s t h e e ve n t s re la t e d t o t h e Ad va n ce d Po we r Ma n a g e m e n t ( APM) . k s w apd Pe rfo rm s m e m o ry re cla im in g , a s d e s crib e d in S e ct io n 1 6 . 7 . 7 . k flu s h d ( a ls o b d flu s h ) Flu s h e s "d irt y" b u ffe rs t o d is k t o re cla im m e m o ry, a s d e s crib e d in S e ct io n 1 4 . 2 . 4 . k u pdate d Flu s h e s o ld "d irt y" b u ffe rs t o d is k t o re d u ce ris ks o f file s ys t e m in co n s is t e n cie s , a s d e s crib e d in S e ct io n 1 4 . 2 . 4 . k s o ft irq d Ru n s t h e t a s kle t s ( s e e s e ct io n S e ct io n 4 . 7 ) ; t h e re is o n e ke rn e l t h re a d fo r e a ch CPU in t h e s ys t e m .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

3.5 Destroying Processes Mo s t p ro ce s s e s "d ie " in t h e s e n s e t h a t t h e y t e rm in a t e t h e e xe cu t io n o f t h e co d e t h e y we re s u p p o s e d t o ru n . Wh e n t h is o ccu rs , t h e ke rn e l m u s t b e n o t ifie d s o t h a t it ca n re le a s e t h e re s o u rce s o wn e d b y t h e p ro ce s s ; t h is in clu d e s m e m o ry, o p e n file s , a n d a n y o t h e r o d d s a n d e n d s t h a t we will e n co u n t e r in t h is b o o k, s u ch a s s e m a p h o re s . Th e u s u a l wa y fo r a p ro ce s s t o t e rm in a t e is t o in vo ke t h e exit( ) lib ra ry fu n ct io n , wh ich re le a s e s t h e re s o u rce s a llo ca t e d b y t h e C lib ra ry, e xe cu t e s e a ch fu n ct io n re g is t e re d b y t h e p ro g ra m m e r, a n d e n d s u p in vo kin g t h e _exit( ) s ys t e m ca ll. Th e exit( ) fu n ct io n m a y b e in s e rt e d b y t h e p ro g ra m m e r e xp licit ly. Ad d it io n a lly, t h e C co m p ile r a lwa ys in s e rt s a n exit( ) fu n ct io n ca ll rig h t a ft e r t h e la s t s t a t e m e n t o f t h e main( ) fu n ct io n .

Alt e rn a t ive ly, t h e ke rn e l m a y fo rce a p ro ce s s t o d ie . Th is t yp ica lly o ccu rs wh e n t h e p ro ce s s h a s re ce ive d a s ig n a l t h a t it ca n n o t h a n d le o r ig n o re ( s e e Ch a p t e r 1 0 ) o r wh e n a n u n re co ve ra b le CPU e xce p t io n h a s b e e n ra is e d in Ke rn e l Mo d e wh ile t h e ke rn e l wa s ru n n in g o n b e h a lf o f t h e p ro ce s s ( s e e Ch a p t e r 4 ) .

3.5.1 Process Termination All p ro ce s s t e rm in a t io n s a re h a n d le d b y t h e do_exit( ) fu n ct io n , wh ich re m o ve s m o s t re fe re n ce s t o t h e t e rm in a t in g p ro ce s s fro m ke rn e l d a t a s t ru ct u re s . Th e do_exit( ) fu n ct io n e xe cu t e s t h e fo llo win g a ct io n s : 1 . S e t s t h e PF_EXITING fla g in t h e flag fie ld o f t h e p ro ce s s d e s crip t o r t o in d ica t e t h a t t h e p ro ce s s is b e in g e lim in a t e d . 2 . Re m o ve s , if n e ce s s a ry, t h e p ro ce s s d e s crip t o r fro m a n IPC s e m a p h o re q u e u e via t h e sem_exit( ) fu n ct io n ( s e e Ch a p t e r 1 9 ) o r fro m a d yn a m ic t im e r q u e u e via t h e

del_timer_sync( ) fu n ct io n ( s e e Ch a p t e r 6 ) . 3 . Exa m in e s t h e p ro ce s s 's d a t a s t ru ct u re s re la t e d t o p a g in g , file s ys t e m , o p e n file d e s crip t o rs , a n d s ig n a l h a n d lin g , re s p e ct ive ly, wit h t h e _ _exit_mm( ), _

_exit_files( ), _ _exit_fs( ), a n d exit_sighand( ) fu n ct io n s . Th e s e fu n ct io n s a ls o re m o ve e a ch o f t h e s e d a t a s t ru ct u re s if n o o t h e r p ro ce s s a re s h a rin g the m . 4 . De cre m e n t s t h e re s o u rce co u n t e rs o f t h e m o d u le s u s e d b y t h e p ro ce s s . 5 . S e t s t h e exit_code fie ld o f t h e p ro ce s s d e s crip t o r t o t h e p ro ce s s t e rm in a t io n co d e . Th is va lu e is e it h e r t h e _exit( ) s ys t e m ca ll p a ra m e t e r ( n o rm a l t e rm in a t io n ) , o r a n e rro r co d e s u p p lie d b y t h e ke rn e l ( a b n o rm a l t e rm in a t io n ) . 6 . In vo ke s t h e exit_notify( ) fu n ct io n t o u p d a t e t h e p a re n t h o o d re la t io n s h ip s o f b o t h t h e p a re n t p ro ce s s a n d t h e ch ild p ro ce s s e s . All ch ild p ro ce s s e s cre a t e d b y t h e t e rm in a t in g p ro ce s s b e co m e ch ild re n o f a n o t h e r p ro ce s s in t h e s a m e t h re a d g ro u p , if a n y, o r o f t h e in it p ro ce s s . Mo re o ve r, exit_notify( ) s e t s t h e state fie ld o f t h e p ro ce s s d e s crip t o r t o TASK_ZOMBIE. We s h a ll s e e wh a t h a p p e n s t o zo m b ie p ro ce s s e s

in t h e fo llo win g s e ct io n . 7 . In vo ke s t h e schedule( ) fu n ct io n ( s e e Ch a p t e r 1 1 ) t o s e le ct a n e w p ro ce s s t o ru n . S in ce a p ro ce s s in a TASK_ZOMBIE s t a t e is ig n o re d b y t h e s ch e d u le r, t h e p ro ce s s s t o p s e xe cu t in g rig h t a ft e r t h e switch_to m a cro in schedule( ) is in vo ke d .

3.5.2 Process Removal Th e Un ix o p e ra t in g s ys t e m a llo ws a p ro ce s s t o q u e ry t h e ke rn e l t o o b t a in t h e PID o f it s p a re n t p ro ce s s o r t h e e xe cu t io n s t a t e o f a n y o f it s ch ild re n . A p ro ce s s m a y, fo r in s t a n ce , cre a t e a ch ild p ro ce s s t o p e rfo rm a s p e cific t a s k a n d t h e n in vo ke a wait( )- like s ys t e m ca ll t o ch e ck wh e t h e r t h e ch ild h a s t e rm in a t e d . If t h e ch ild h a s t e rm in a t e d , it s t e rm in a t io n co d e will t e ll t h e p a re n t p ro ce s s if t h e t a s k h a s b e e n ca rrie d o u t s u cce s s fu lly. To co m p ly wit h t h e s e d e s ig n ch o ice s , Un ix ke rn e ls a re n o t a llo we d t o d is ca rd d a t a in clu d e d in a p ro ce s s d e s crip t o r fie ld rig h t a ft e r t h e p ro ce s s t e rm in a t e s . Th e y a re a llo we d t o d o s o o n ly a ft e r t h e p a re n t p ro ce s s h a s is s u e d a wait( )- like s ys t e m ca ll t h a t re fe rs t o t h e t e rm in a t e d p ro ce s s . Th is is wh y t h e TASK_ZOMBIE s t a t e h a s b e e n in t ro d u ce d : a lt h o u g h t h e p ro ce s s is t e ch n ica lly d e a d , it s d e s crip t o r m u s t b e s a ve d u n t il t h e p a re n t p ro ce s s is n o t ifie d . Wh a t h a p p e n s if p a re n t p ro ce s s e s t e rm in a t e b e fo re t h e ir ch ild re n ? In s u ch a ca s e , t h e s ys t e m co u ld b e flo o d e d wit h zo m b ie p ro ce s s e s t h a t m ig h t e n d u p u s in g a ll t h e a va ila b le task e n t rie s . As m e n t io n e d e a rlie r, t h is p ro b le m is s o lve d b y fo rcin g a ll o rp h a n p ro ce s s e s t o b e co m e ch ild re n o f t h e in it p ro ce s s . In t h is wa y, t h e in it p ro ce s s will d e s t ro y t h e zo m b ie s wh ile ch e ckin g fo r t h e t e rm in a t io n o f o n e o f it s le g it im a t e ch ild re n t h ro u g h a wait( )- like s ys t e m ca ll. Th e release_task( ) fu n ct io n re le a s e s t h e p ro ce s s d e s crip t o r o f a zo m b ie p ro ce s s b y e xe cu t in g t h e fo llo win g s t e p s : 1 . De cre m e n t s b y 1 t h e n u m b e r o f p ro ce s s e s cre a t e d u p t o n o w b y t h e u s e r o wn e r o f t h e t e rm in a t e d p ro ce s s . Th is va lu e is s t o re d in t h e user_struct s t ru ct u re m e n t io n e d e a rlie r in t h e ch a p t e r. 2 . In vo ke s t h e free_uid( ) fu n ct io n t o d e cre m e n t b y 1 t h e re s o u rce co u n t e r o f t h e

user_struct s t ru ct u re . 3 . In vo ke s unhash_process( ), wh ich in t u rn :

a . De cre m e n t s b y 1 t h e nr_threads va ria b le

b . In vo ke s unhash_pid( ) t o re m o ve t h e p ro ce s s d e s crip t o r fro m t h e

pidhash h a s h t a b le c. Us e s t h e REMOVE_LINKS m a cro t o u n lin k t h e p ro ce s s d e s crip t o r fro m t h e p ro ce s s lis t d . Re m o ve s t h e p ro ce s s fro m it s t h re a d g ro u p , if a n y

4 . In vo ke s t h e free_task_struct( ) fu n ct io n t o re le a s e t h e 8 - KB m e m o ry a re a u s e d t o co n t a in t h e p ro ce s s d e s crip t o r a n d t h e Ke rn e l Mo d e s t a ck. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 4. Interrupts and Exceptions An in t e rru p t is u s u a lly d e fin e d a s a n e ve n t t h a t a lt e rs t h e s e q u e n ce o f in s t ru ct io n s e xe cu t e d b y a p ro ce s s o r. S u ch e ve n t s co rre s p o n d t o e le ct rica l s ig n a ls g e n e ra t e d b y h a rd wa re circu it s b o t h in s id e a n d o u t s id e t h e CPU ch ip . In t e rru p t s a re o ft e n d ivid e d in t o s y n ch ro n o u s a n d a s y n ch ro n o u s in t e rru p t s : ●



S y n ch ro n o u s in t e rru p t s a re p ro d u ce d b y t h e CPU co n t ro l u n it wh ile e xe cu t in g in s t ru ct io n s a n d a re ca lle d s yn ch ro n o u s b e ca u s e t h e co n t ro l u n it is s u e s t h e m o n ly a ft e r t e rm in a t in g t h e e xe cu t io n o f a n in s t ru ct io n . As y n ch ro n o u s in t e rru p t s a re g e n e ra t e d b y o t h e r h a rd wa re d e vice s a t a rb it ra ry t im e s wit h re s p e ct t o t h e CPU clo ck s ig n a ls .

In t e l m icro p ro ce s s o r m a n u a ls d e s ig n a t e s yn ch ro n o u s a n d a s yn ch ro n o u s in t e rru p t s a s e x ce p t io n s a n d in t e rru p t s , re s p e ct ive ly. We 'll a d o p t t h is cla s s ifica t io n , a lt h o u g h we 'll o cca s io n a lly u s e t h e t e rm "in t e rru p t s ig n a l" t o d e s ig n a t e b o t h t yp e s t o g e t h e r ( s yn ch ro n o u s a s we ll a s a s yn ch ro n o u s ) . In t e rru p t s a re is s u e d b y in t e rva l t im e rs a n d I/ O d e vice s ; fo r in s t a n ce , t h e a rriva l o f a ke ys t ro ke fro m a u s e r s e t s o ff a n in t e rru p t . Exce p t io n s , o n t h e o t h e r h a n d , a re ca u s e d e it h e r b y p ro g ra m m in g e rro rs o r b y a n o m a lo u s co n d it io n s t h a t m u s t b e h a n d le d b y t h e ke rn e l. In t h e firs t ca s e , t h e ke rn e l h a n d le s t h e e xce p t io n b y d e live rin g t o t h e cu rre n t p ro ce s s o n e o f t h e s ig n a ls fa m ilia r t o e ve ry Un ix p ro g ra m m e r. In t h e s e co n d ca s e , t h e ke rn e l p e rfo rm s a ll t h e s t e p s n e e d e d t o re co ve r fro m t h e a n o m a lo u s co n d it io n , s u ch a s a Pa g e Fa u lt o r a re q u e s t ( via a n int in s t ru ct io n ) fo r a ke rn e l s e rvice .

We s t a rt b y d e s crib in g in t h e n e xt s e ct io n t h e m o t iva t io n fo r in t ro d u cin g s u ch s ig n a ls . We t h e n s h o w h o w t h e we ll- kn o wn IRQs ( In t e rru p t Re Qu e s t s ) is s u e d b y I/ O d e vice s g ive ris e t o in t e rru p t s , a n d we d e t a il h o w 8 0 x 8 6 p ro ce s s o rs h a n d le in t e rru p t s a n d e xce p t io n s a t t h e h a rd wa re le ve l. Th e n we illu s t ra t e , in S e ct io n 4 . 4 , h o w Lin u x in it ia lize s a ll t h e d a t a s t ru ct u re s re q u ire d b y t h e In t e l in t e rru p t a rch it e ct u re . Th e re m a in in g t h re e s e ct io n s d e s crib e h o w Lin u x h a n d le s in t e rru p t s ig n a ls a t t h e s o ft wa re le ve l. On e wo rd o f ca u t io n b e fo re m o vin g o n : in t h is ch a p t e r, we co ve r o n ly "cla s s ic" in t e rru p t s co m m o n t o a ll PCs ; we d o n o t co ve r t h e n o n s t a n d a rd in t e rru p t s o f s o m e a rch it e ct u re s . Fo r in s t a n ce , la p t o p s g e n e ra t e t yp e s o f in t e rru p t s n o t d is cu s s e d h e re .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

4.1 The Role of Interrupt Signals As t h e n a m e s u g g e s t s , in t e rru p t s ig n a ls p ro vid e a wa y t o d ive rt t h e p ro ce s s o r t o co d e o u t s id e t h e n o rm a l flo w o f co n t ro l. Wh e n a n in t e rru p t s ig n a l a rrive s , t h e CPU m u s t s t o p wh a t it 's cu rre n t ly d o in g a n d s wit ch t o a n e w a ct ivit y; it d o e s t h is b y s a vin g t h e cu rre n t va lu e o f t h e p ro g ra m co u n t e r ( i. e . , t h e co n t e n t o f t h e eip a n d cs re g is t e rs ) in t h e Ke rn e l Mo d e s t a ck a n d b y p la cin g a n a d d re s s re la t e d t o t h e in t e rru p t t yp e in t o t h e p ro g ra m co u n t e r. Th e re a re s o m e t h in g s in t h is ch a p t e r t h a t will re m in d yo u o f t h e co n t e xt s wit ch d e s crib e d in t h e p re vio u s ch a p t e r, ca rrie d o u t wh e n a ke rn e l s u b s t it u t e s o n e p ro ce s s fo r a n o t h e r. Bu t t h e re is a ke y d iffe re n ce b e t we e n in t e rru p t h a n d lin g a n d p ro ce s s s wit ch in g : t h e co d e e xe cu t e d b y a n in t e rru p t o r b y a n e xce p t io n h a n d le r is n o t a p ro ce s s . Ra t h e r, it is a ke rn e l co n t ro l p a t h t h a t ru n s o n b e h a lf o f t h e s a m e p ro ce s s t h a t wa s ru n n in g wh e n t h e in t e rru p t o ccu rre d ( s e e t h e la t e r s e ct io n S e ct io n 4 . 3 ) . As a ke rn e l co n t ro l p a t h , t h e in t e rru p t h a n d le r is lig h t e r t h a n a p ro ce s s ( it h a s le s s co n t e xt a n d re q u ire s le s s t im e t o s e t u p o r t e a r d o wn ) . In t e rru p t h a n d lin g is o n e o f t h e m o s t s e n s it ive t a s ks p e rfo rm e d b y t h e ke rn e l, s in ce it m u s t s a t is fy t h e fo llo win g co n s t ra in t s : ●





In t e rru p t s ca n co m e a t a n y t im e , wh e n t h e ke rn e l m a y wa n t t o fin is h s o m e t h in g e ls e it wa s t ryin g t o d o . Th e ke rn e l's g o a l is t h e re fo re t o g e t t h e in t e rru p t o u t o f t h e wa y a s s o o n a s p o s s ib le a n d d e fe r a s m u ch p ro ce s s in g a s it ca n . Fo r in s t a n ce , s u p p o s e a b lo ck o f d a t a h a s a rrive d o n a n e t wo rk lin e . Wh e n t h e h a rd wa re in t e rru p t s t h e ke rn e l, it co u ld s im p ly m a rk t h e p re s e n ce o f d a t a , g ive t h e p ro ce s s o r b a ck t o wh a t e ve r wa s ru n n in g b e fo re , a n d d o t h e re s t o f t h e p ro ce s s in g la t e r ( s u ch a s m o vin g t h e d a t a in t o a b u ffe r wh e re it s re cip ie n t p ro ce s s ca n fin d it a n d t h e n re s t a rt in g t h e p ro ce s s ) . Th e a ct ivit ie s t h a t t h e ke rn e l n e e d s t o p e rfo rm in re s p o n s e t o a n in t e rru p t a re t h u s d ivid e d in t o t wo p a rt s : a t o p h a lf t h a t t h e ke rn e l e xe cu t e s rig h t a wa y a n d a b o t t o m h a lf t h a t is le ft fo r la t e r. Th e ke rn e l ke e p s a q u e u e p o in t in g t o a ll t h e fu n ct io n s t h a t re p re s e n t b o t t o m h a lve s wa it in g t o b e e xe cu t e d a n d p u lls t h e m o ff t h e q u e u e t o e xe cu t e t h e m a t p a rt icu la r p o in t s in p ro ce s s in g . S in ce in t e rru p t s ca n co m e a t a n y t im e , t h e ke rn e l m ig h t b e h a n d lin g o n e o f t h e m wh ile a n o t h e r o n e ( o f a d iffe re n t t yp e ) o ccu rs . Th is s h o u ld b e a llo we d a s m u ch a s p o s s ib le s in ce it ke e p s t h e I/ O d e vice s b u s y ( s e e t h e la t e r s e ct io n S e ct io n 4 . 3 ) . As a re s u lt , t h e in t e rru p t h a n d le rs m u s t b e co d e d s o t h a t t h e co rre s p o n d in g ke rn e l co n t ro l p a t h s ca n b e e xe cu t e d in a n e s t e d m a n n e r. Wh e n t h e la s t ke rn e l co n t ro l p a t h t e rm in a t e s , t h e ke rn e l m u s t b e a b le t o re s u m e e xe cu t io n o f t h e in t e rru p t e d p ro ce s s o r s wit ch t o a n o t h e r p ro ce s s if t h e in t e rru p t s ig n a l h a s ca u s e d a re s ch e d u lin g a ct ivit y. Alt h o u g h t h e ke rn e l m a y a cce p t a n e w in t e rru p t s ig n a l wh ile h a n d lin g a p re vio u s o n e , s o m e crit ica l re g io n s e xis t in s id e t h e ke rn e l co d e wh e re in t e rru p t s m u s t b e d is a b le d . S u ch crit ica l re g io n s m u s t b e lim it e d a s m u ch a s p o s s ib le s in ce , a cco rd in g t o t h e p re vio u s re q u ire m e n t , t h e ke rn e l, a n d p a rt icu la rly t h e in t e rru p t h a n d le rs , s h o u ld ru n m o s t o f t h e t im e wit h t h e in t e rru p t s e n a b le d .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

4.2 Interrupts and Exceptions Th e In t e l d o cu m e n t a t io n cla s s ifie s in t e rru p t s a n d e xce p t io n s a s fo llo ws : ●

In t e rru p t s : Ma s k a b le in t e rru p t s All In t e rru p t Re q u e s t s ( IRQs ) is s u e d b y I/ O d e vice s g ive ris e t o m a s ka b le in t e rru p t s . A m a s ka b le in t e rru p t ca n b e in t wo s t a t e s : m a s ke d o r u n m a s ke d ; a m a s ke d in t e rru p t is ig n o re d b y t h e co n t ro l u n it a s lo n g a s it re m a in s m a s ke d . No n m a s k a b le in t e rru p t s On ly a fe w crit ica l e ve n t s ( s u ch a s h a rd wa re fa ilu re s ) g ive ris e t o n o n m a s ka b le in t e rru p t s . No n m a s ka b le in t e rru p t s a re a lwa ys re co g n ize d b y t h e CPU.



Exce p t io n s : Pro ce s s o r- d e t e ct e d e x ce p t io n s Ge n e ra t e d wh e n t h e CPU d e t e ct s a n a n o m a lo u s co n d it io n wh ile e xe cu t in g a n in s t ru ct io n . Th e s e a re fu rt h e r d ivid e d in t o t h re e g ro u p s , d e p e n d in g o n t h e va lu e o f t h e eip re g is t e r t h a t is s a ve d o n t h e Ke rn e l Mo d e s t a ck wh e n t h e CPU co n t ro l u n it ra is e s t h e e xce p t io n . Fa u lt s

Ca n g e n e ra lly b e co rre ct e d ; o n ce co rre ct e d , t h e p ro g ra m is a llo we d t o re s t a rt wit h n o lo s s o f co n t in u it y. Th e s a ve d va lu e o f e ip is t h e a d d re s s o f t h e in s t ru ct io n t h a t ca u s e d t h e fa u lt , a n d h e n ce t h a t in s t ru ct io n ca n b e re s u m e d wh e n t h e e xce p t io n h a n d le r t e rm in a t e s . As we s h a ll s e e in S e ct io n 8 . 4 , re s u m in g t h e s a m e in s t ru ct io n is n e ce s s a ry wh e n e ve r t h e h a n d le r is a b le t o co rre ct t h e a n o m a lo u s co n d it io n t h a t ca u s e d t h e e xce p t io n .

Tra p s

Re p o rt e d im m e d ia t e ly fo llo win g t h e e xe cu t io n o f t h e t ra p p in g in s t ru ct io n ; a ft e r t h e ke rn e l re t u rn s co n t ro l t o t h e p ro g ra m , it is a llo we d t o co n t in u e it s e xe cu t io n wit h n o lo s s o f co n t in u it y. Th e s a ve d va lu e o f eip is t h e a d d re s s o f t h e in s t ru ct io n t h a t s h o u ld b e e xe cu t e d a ft e r t h e o n e t h a t ca u s e d t h e t ra p . A t ra p is t rig g e re d o n ly wh e n t h e re is n o n e e d t o re e xe cu t e t h e in s t ru ct io n t h a t t e rm in a t e d . Th e m a in u s e o f t ra p s is fo r d e b u g g in g p u rp o s e s . Th e ro le o f t h e in t e rru p t s ig n a l in t h is ca s e is t o n o t ify t h e d e b u g g e r t h a t a s p e cific in s t ru ct io n h a s b e e n e xe cu t e d ( fo r in s t a n ce , a b re a kp o in t h a s b e e n re a ch e d wit h in a p ro g ra m ) . On ce t h e u s e r h a s e xa m in e d t h e d a t a p ro vid e d b y t h e d e b u g g e r, s h e m a y a s k t h a t e xe cu t io n o f t h e d e b u g g e d p ro g ra m re s u m e , s t a rt in g fro m t h e n e xt in s t ru ct io n .

Ab o rt s

A s e rio u s e rro r o ccu rre d ; t h e co n t ro l u n it is in t ro u b le , a n d it m a y b e u n a b le t o s t o re in t h e eip re g is t e r t h e p re cis e lo ca t io n o f t h e in s t ru ct io n ca u s in g t h e e xce p t io n . Ab o rt s a re u s e d t o re p o rt s e ve re e rro rs , s u ch a s h a rd wa re fa ilu re s a n d in va lid o r in co n s is t e n t va lu e s in s ys t e m t a b le s . Th e in t e rru p t s ig n a l s e n t b y t h e co n t ro l u n it is a n e m e rg e n cy s ig n a l u s e d t o s wit ch co n t ro l t o t h e co rre s p o n d in g a b o rt e xce p t io n h a n d le r. Th is h a n d le r h a s n o ch o ice b u t t o fo rce t h e a ffe ct e d

p ro ce s s t o t e rm in a t e .

Pro g ra m m e d e x ce p t io n s Occu r a t t h e re q u e s t o f t h e p ro g ra m m e r. Th e y a re t rig g e re d b y int o r int3 in s t ru ct io n s ; t h e into ( ch e ck fo r o ve rflo w) a n d bound ( ch e ck o n a d d re s s b o u n d ) in s t ru ct io n s a ls o g ive ris e t o a p ro g ra m m e d e xce p t io n wh e n t h e co n d it io n t h e y a re ch e ckin g is n o t t ru e . Pro g ra m m e d e xce p t io n s a re h a n d le d b y t h e co n t ro l u n it a s t ra p s ; t h e y a re o ft e n ca lle d s o ft w a re in t e rru p t s . S u ch e xce p t io n s h a ve t wo co m m o n u s e s : t o im p le m e n t s ys t e m ca lls a n d t o n o t ify a d e b u g g e r o f a s p e cific e ve n t ( s e e Ch a p t e r 9 ) . Ea ch in t e rru p t o r e xce p t io n is id e n t ifie d b y a n u m b e r ra n g in g fro m 0 t o 2 5 5 ; In t e l ca lls t h is 8 b it u n s ig n e d n u m b e r a v e ct o r. Th e ve ct o rs o f n o n m a s ka b le in t e rru p t s a n d e xce p t io n s a re fixe d , wh ile t h o s e o f m a s ka b le in t e rru p t s ca n b e a lt e re d b y p ro g ra m m in g t h e In t e rru p t Co n t ro lle r ( s e e t h e n e xt s e ct io n ) .

4.2.1 IRQs and Interrupts Ea ch h a rd wa re d e vice co n t ro lle r ca p a b le o f is s u in g in t e rru p t re q u e s t s h a s a n o u t p u t lin e d e s ig n a t e d a s a n In t e rru p t Re Qu e s t ( IRQ) . All e xis t in g IRQ lin e s a re co n n e ct e d t o t h e in p u t p in s o f a h a rd wa re circu it ca lle d t h e In t e rru p t Co n t ro lle r, wh ich p e rfo rm s t h e fo llo win g a ct io n s : 1 . Mo n it o rs t h e IRQ lin e s , ch e ckin g fo r ra is e d s ig n a ls . 2 . If a ra is e d s ig n a l o ccu rs o n a n IRQ lin e : a . Co n ve rt s t h e ra is e d s ig n a l re ce ive d in t o a co rre s p o n d in g ve ct o r. b . S t o re s t h e ve ct o r in a n In t e rru p t Co n t ro lle r I/ O p o rt , t h u s a llo win g t h e CPU t o re a d it via t h e d a t a b u s . c. S e n d s a ra is e d s ig n a l t o t h e p ro ce s s o r INTR p in —t h a t is , is s u e s a n in t e rru p t . d . Wa it s u n t il t h e CPU a ckn o wle d g e s t h e in t e rru p t s ig n a l b y writ in g in t o o n e o f t h e Pro g ra m m a b le In t e rru p t Co n t ro lle rs ( PIC) I/ O p o rt s ; wh e n t h is o ccu rs , cle a rs t h e INTR lin e . 3 . Go e s b a ck t o S t e p 1 . Th e IRQ lin e s a re s e q u e n t ia lly n u m b e re d s t a rt in g fro m 0 ; t h e re fo re , t h e firs t IRQ lin e is u s u a lly d e n o t e d a s IRQ0 . In t e l's d e fa u lt ve ct o r a s s o cia t e d wit h IRQn is n + 3 2 . As m e n t io n e d b e fo re , t h e m a p p in g b e t we e n IRQs a n d ve ct o rs ca n b e m o d ifie d b y is s u in g s u it a b le I/ O in s t ru ct io n s t o t h e In t e rru p t Co n t ro lle r p o rt s . Ea ch IRQ lin e ca n b e s e le ct ive ly d is a b le d . Th u s , t h e PIC ca n b e p ro g ra m m e d t o d is a b le IRQs . Th a t is , t h e PIC ca n b e t o ld t o s t o p is s u in g in t e rru p t s t h a t re fe r t o a g ive n IRQ lin e , o r t o e n a b le t h e m . Dis a b le d in t e rru p t s a re n o t lo s t ; t h e PIC s e n d s t h e m t o t h e CPU a s s o o n a s t h e y a re e n a b le d a g a in . Th is fe a t u re is u s e d b y m o s t in t e rru p t h a n d le rs s in ce it a llo ws t h e m t o p ro ce s s IRQs o f t h e s a m e t yp e s e ria lly.

S e le ct ive e n a b lin g / d is a b lin g o f IRQs is n o t t h e s a m e a s g lo b a l m a s kin g / u n m a s kin g o f m a s ka b le in t e rru p t s . Wh e n t h e IF fla g o f t h e eflags re g is t e r is cle a r, e a ch m a s ka b le in t e rru p t is s u e d b y t h e PIC is t e m p o ra rily ig n o re d b y t h e CPU. Th e cli a n d sti a s s e m b ly la n g u a g e in s t ru ct io n s , re s p e ct ive ly, cle a r a n d s e t t h a t fla g . Ma s kin g a n d u n m a s kin g in t e rru p t s o n a m u lt ip ro ce s s o r s ys t e m is t rickie r s in ce e a ch CPU h a s it s o wn eflags re g is t e r. We 'll d e a l wit h t h is t o p ic in Ch a p t e r 5 . Tra d it io n a l PICs a re im p le m e n t e d b y co n n e ct in g "in ca s ca d e " t wo 8 2 5 9 A- s t yle e xt e rn a l ch ip s . Ea ch ch ip ca n h a n d le u p t o e ig h t d iffe re n t IRQ in p u t lin e s . S in ce t h e INT o u t p u t lin e o f t h e s la ve PIC is co n n e ct e d t o t h e IRQ2 p in o f t h e m a s t e r PIC, t h e n u m b e r o f a va ila b le IRQ lin e s is lim it e d t o 1 5 .

4.2.1.1 The Advanced Programmable Interrupt Controller (APIC) Th e p re vio u s d e s crip t io n re fe rs t o PICs d e s ig n e d fo r u n ip ro ce s s o r s ys t e m s . If t h e s ys t e m in clu d e s a s in g le CPU, t h e o u t p u t lin e o f t h e m a s t e r PIC ca n b e co n n e ct e d in a s t ra ig h t fo rwa rd wa y t o t h e INTR p in t h e CPU. Ho we ve r, if t h e s ys t e m in clu d e s t wo o r m o re CPUs , t h is a p p ro a ch is n o lo n g e r va lid a n d m o re s o p h is t ica t e d PICs a re n e e d e d . Be in g a b le t o d e live r in t e rru p t s t o e a ch CPU in t h e s ys t e m is cru cia l fo r fu lly e xp lo it in g t h e p a ra lle lis m o f t h e S MP a rch it e ct u re . Fo r t h a t re a s o n , In t e l h a s in t ro d u ce d a n e w co m p o n e n t d e s ig n a t e d a s t h e I/ O Ad v a n ce d Pro g ra m m a b le In t e rru p t Co n t ro lle r ( I/ O APIC) , wh ich re p la ce s t h e o ld 8 2 5 9 A Pro g ra m m a b le In t e rru p t Co n t ro lle r. Mo re o ve r, a ll cu rre n t In t e l CPUs in clu d e a lo ca l APIC. Ea ch Lo ca l APIC h a s 3 2 - b it re g is t e rs , a n in t e rn a l clo ck, a lo ca l t im e r d e vice , a n d t wo a d d it io n a l IRQ lin e s LINT0 a n d LINT1 re s e rve d fo r lo ca l in t e rru p t s . All lo ca l APICs a re co n n e ct e d t o a n e xt e rn a l I/ O APIC, g ivin g ra is e t o a m u lt i- APIC s ys t e m . Fig u re 4 - 1 illu s t ra t e s in a s ch e m a t ic wa y t h e s t ru ct u re o f a m u lt i- APIC s ys t e m . An APIC b u s co n n e ct s t h e "fro n t e n d " I/ O APIC t o t h e lo ca l APICs . Th e IRQ lin e s co m in g fro m t h e d e vice s a re co n n e ct e d t o t h e I/ O APIC, wh ich t h e re fo re a ct s a s a ro u t e r wit h re s p e ct t o t h e lo ca l APICs . In t h e m o t h e rb o a rd s o f t h e Pe n t iu m III a n d e a rlie r p ro ce s s o rs , t h e APIC b u s wa s a s e ria l t h re e - lin e b u s ; s t a rt in g wit h t h e Pe n t iu m 4 , t h e APIC b u s is im p le m e n t e d b y m e a n s o f t h e s ys t e m b u s . Ho we ve r, s in ce t h e APIC b u s a n d it s m e s s a g e s a re in vis ib le t o s o ft wa re , we wo n 't g ive fu rt h e r d e t a ils . Fig u re 4 - 1 . Mu lt i- AP I C s y s t e m

Th e I/ O APIC co n s is t s o f a s e t o f 2 4 IRQ lin e s , a 2 4 - e n t ry In t e rru p t Re d ire ct io n Ta b le , p ro g ra m m a b le re g is t e rs , a n d a m e s s a g e u n it fo r s e n d in g a n d re ce ivin g APIC m e s s a g e s o ve r t h e APIC b u s . Un like IRQ p in s o f t h e 8 2 5 9 A, in t e rru p t p rio rit y is n o t re la t e d t o p in n u m b e r: e a ch e n t ry in t h e Re d ire ct io n Ta b le ca n b e in d ivid u a lly p ro g ra m m e d t o in d ica t e t h e in t e rru p t ve ct o r a n d p rio rit y, t h e d e s t in a t io n p ro ce s s o r, a n d h o w t h e p ro ce s s o r is s e le ct e d . Th e in fo rm a t io n in t h e Re d ire ct io n Ta b le is u s e d t o t ra n s la t e e a ch e xt e rn a l IRQ s ig n a l in t o a m e s s a g e t o o n e o r m o re lo ca l APIC u n it s via t h e APIC b u s . In t e rru p t re q u e s t s co m in g fro m e xt e rn a l h a rd wa re d e vice s ca n b e d is t rib u t e d a m o n g t h e a va ila b le CPUs in t wo wa ys : S t a t ic d is t rib u t io n Th e IRQ s ig n a l is d e live re d t o t h e lo ca l APICs lis t e d in t h e co rre s p o n d in g Re d ire ct io n Ta b le e n t ry. Th e in t e rru p t is d e live re d t o o n e s p e cific CPU, t o a s u b s e t o f CPUs , o r t o a ll CPUs a t o n ce ( b ro a d ca s t m o d e ) . Dy n a m ic d is t rib u t io n Th e IRQ s ig n a l is d e live re d t o t h e lo ca l APIC o f t h e p ro ce s s o r t h a t is e xe cu t in g t h e p ro ce s s wit h t h e lo we s t p rio rit y. An y lo ca l APIC h a s a p ro g ra m m a b le t a s k p rio rit y re g is t e r ( TPR) , wh ich is u s e d t o co m p u t e t h e p rio rit y o f t h e cu rre n t ly ru n n in g p ro ce s s . In t e l e xp e ct s t h is re g is t e r t o b e m o d ifie d in a n o p e ra t in g s ys t e m ke rn e l b y e a ch p ro ce s s s wit ch . If t wo o r m o re CPUs s h a re t h e lo we s t p rio rit y, t h e lo a d is d is t rib u t e d b e t we e n t h e m u s in g a t e ch n iq u e ca lle d a rb it ra t io n . Ea ch CPU is a s s ig n e d a n a rb it ra t io n p rio rit y ra n g in g fro m 0 t o 1 5 in t h e a rb it ra t io n p rio rit y re g is t e r o f t h e lo ca l APIC. Eve ry lo ca l APIC h a s a u n iq u e va lu e Eve ry t im e a n in t e rru p t is d e live re d t o a CPU, it s co rre s p o n d in g a rb it ra t io n p rio rit y is a u t o m a t ica lly s e t t o 0 , wh ile t h e a rb it ra t io n p rio rit ie s o f e ve ry o t h e r CPU is in cre m e n t e d . Wh e n t h e a rb it ra t io n p rio rit y re g is t e r b e co m e s g re a t e r t h a n 1 5 , it is s e t t o t h e p re vio u s a rb it ra t io n p rio rit y o f t h e win n in g CPU in cre m e n t e d b y 1 . Th e re fo re , in t e rru p t s a re d is t rib u t e d in a ro u n d - ro b in fa s h io n a m o n g CPUs wit h t h e s a m e t a s k p rio rit y. [ 1 ] [1]

Th e Pe n t iu m 4 lo ca l APIC d o e s n 't h a ve a n a rb it ra t io n p rio rit y re g is t e r; t h e a rb it ra t io n m e ch a n is m is h id d e n in t h e b u s a rb it ra t io n circu it ry. Th e In t e l m a n u a ls s t a t e t h a t if t h e o p e ra t in g s ys t e m ke rn e l d o e s n o t re g u la rly u p d a t e t h e t a s k p rio rit y re g is t e rs , p e rfo rm a n ce s m a y b e s u b o p t im a l b e ca u s e in t e rru p t s m ig h t a lwa ys b e s e rvice d b y t h e s a m e CPU. Be s id e s d is t rib u t in g in t e rru p t s a m o n g p ro ce s s o rs , t h e m u lt i- APIC s ys t e m a llo ws CPUs t o g e n e ra t e in t e rp ro ce s s o r in t e rru p t s . Wh e n a CPU wis h e s t o s e n d a n in t e rru p t t o a n o t h e r CPU, it s t o re s t h e in t e rru p t ve ct o r a n d t h e id e n t ifie r o f t h e t a rg e t 's lo ca l APIC in t h e In t e rru p t Co m m a n d Re g is t e r ( ICR) o f it s o wn lo ca l APIC. A m e s s a g e is t h e n s e n t via t h e APIC b u s t o t h e t a rg e t 's lo ca l APIC, wh ich t h e re fo re is s u e s a co rre s p o n d in g in t e rru p t t o it s o wn CPU. In t e rp ro ce s s o r in t e rru p t s ( in s h o rt , IPIs ) a re p a rt o f t h e S MP a rch it e ct u re a n d a re a ct ive ly u s e d b y Lin u x t o e xch a n g e m e s s a g e s a m o n g CPUs ( s e e S e ct io n 4 . 6 . 1 . 7 la t e r in t h is

ch a p t e r) . Mo s t o f t h e cu rre n t u n ip ro ce s s o r s ys t e m s in clu d e a n I/ O APIC ch ip , wh ich m a y b e co n fig u re d in t wo d is t in ct wa ys : ●



As a s t a n d a rd 8 2 5 9 A- s t yle e xt e rn a l PIC co n n e ct e d t o t h e CPU. Th e lo ca l APIC is d is a b le d a n d t h e t wo LINT0 a n d LINT1 lo ca l IRQ lin e s a re co n fig u re d , re s p e ct ive ly, a s t h e INTR a n d NMI p in s . As a s t a n d a rd e xt e rn a l I/ O APIC. Th e lo ca l APIC is e n a b le d a n d a ll e xt e rn a l in t e rru p t s a re re ce ive d t h ro u g h t h e I/ O APIC.

4.2.2 Exceptions Th e 8 0 x 8 6 m icro p ro ce s s o rs is s u e ro u g h ly 2 0 d iffe re n t e xce p t io n s . [ 2 ] Th e ke rn e l m u s t p ro vid e a d e d ica t e d e xce p t io n h a n d le r fo r e a ch e xce p t io n t yp e . Fo r s o m e e xce p t io n s , t h e CPU co n t ro l u n it a ls o g e n e ra t e s a h a rd w a re e rro r co d e a n d p u s h e s it in t h e Ke rn e l Mo d e s t a ck b e fo re s t a rt in g t h e e xce p t io n h a n d le r. [2]

Th e e xa ct n u m b e r d e p e n d s o n t h e p ro ce s s o r m o d e l.

Th e fo llo win g lis t g ive s t h e ve ct o r, t h e n a m e , t h e t yp e , a n d a b rie f d e s crip t io n o f t h e e xce p t io n s fo u n d in 8 0 x 8 6 p ro ce s s o rs . Ad d it io n a l in fo rm a t io n m a y b e fo u n d in t h e In t e l t e ch n ica l d o cu m e n t a t io n . 0 - "Div id e e rro r" ( fa u lt ) Ra is e d wh e n a p ro g ra m is s u e s a n in t e g e r d ivis io n b y 0 . 1 - "De b u g " ( t ra p o r fa u lt ) Ra is e d wh e n t h e T fla g o f eflags is s e t ( q u it e u s e fu l t o im p le m e n t s t e p - b y- s t e p e xe cu t io n o f a d e b u g g e d p ro g ra m ) o r wh e n t h e a d d re s s o f a n in s t ru ct io n o r o p e ra n d fa lls wit h in t h e ra n g e o f a n a ct ive d e b u g re g is t e r ( s e e S e ct io n 3 . 3 . 1 ) . 2 - No t u s e d Re s e rve d fo r n o n m a s ka b le in t e rru p t s ( t h o s e t h a t u s e t h e NMI p in ) . 3 - "Bre a k p o in t " ( t ra p ) Ca u s e d b y a n int3 ( b re a kp o in t ) in s t ru ct io n ( u s u a lly in s e rt e d b y a d e b u g g e r) .

4 - "Ov e rflo w " ( t ra p ) An into ( ch e ck fo r o ve rflo w) in s t ru ct io n h a s b e e n e xe cu t e d wh e n t h e OF ( o ve rflo w) fla g o f eflags is s e t .

5 - "Bo u n d s ch e ck " ( fa u lt )

A bound ( ch e ck o n a d d re s s b o u n d ) in s t ru ct io n is e xe cu t e d wit h t h e o p e ra n d o u t s id e o f t h e va lid a d d re s s b o u n d s . 6 - "In v a lid o p co d e " ( fa u lt ) Th e CPU e xe cu t io n u n it h a s d e t e ct e d a n in va lid o p co d e ( t h e p a rt o f t h e m a ch in e in s t ru ct io n t h a t d e t e rm in e s t h e o p e ra t io n p e rfo rm e d ) . 7 - "De v ice n o t a v a ila b le " ( fa u lt ) An ES CAPE, MMX, o r XMM in s t ru ct io n h a s b e e n e xe cu t e d wit h t h e TS fla g o f cr0 s e t ( s e e S e ct io n 3 . 3 . 4 ) . 8 - "Do u b le fa u lt " ( a b o rt ) No rm a lly, wh e n t h e CPU d e t e ct s a n e xce p t io n wh ile t ryin g t o ca ll t h e h a n d le r fo r a p rio r e xce p t io n , t h e t wo e xce p t io n s ca n b e h a n d le d s e ria lly. In a fe w ca s e s , h o we ve r, t h e p ro ce s s o r ca n n o t h a n d le t h e m s e ria lly, s o it ra is e s t h is e xce p t io n . 9 - "Co p ro ce s s o r s e g m e n t o v e rru n " ( a b o rt ) Pro b le m s wit h t h e e xt e rn a l m a t h e m a t ica l co p ro ce s s o r ( a p p lie s o n ly t o o ld 8 0 3 8 6 m icro p ro ce s s o rs ) . 1 0 - "In v a lid TS S " ( fa u lt ) Th e CPU h a s a t t e m p t e d a co n t e xt s wit ch t o a p ro ce s s h a vin g a n in va lid Ta s k S t a t e Se gm e nt. 1 1 - "S e g m e n t n o t p re s e n t " ( fa u lt ) A re fe re n ce wa s m a d e t o a s e g m e n t n o t p re s e n t in m e m o ry ( o n e in wh ich t h e Segment-Present fla g o f t h e S e g m e n t De s crip t o r wa s cle a re d ) .

1 2 - "S t a ck s e g m e n t " ( fa u lt ) Th e in s t ru ct io n a t t e m p t e d t o e xce e d t h e s t a ck s e g m e n t lim it , o r t h e s e g m e n t id e n t ifie d b y ss is n o t p re s e n t in m e m o ry.

1 3 - "Ge n e ra l p ro t e ct io n " ( fa u lt ) On e o f t h e p ro t e ct io n ru le s in t h e p ro t e ct e d m o d e o f t h e 8 0 x 8 6 h a s b e e n vio la t e d . 1 4 - "Pa g e Fa u lt " ( fa u lt ) Th e a d d re s s e d p a g e is n o t p re s e n t in m e m o ry, t h e co rre s p o n d in g Pa g e Ta b le e n t ry is n u ll, o r a vio la t io n o f t h e p a g in g p ro t e ct io n m e ch a n is m h a s o ccu rre d . 1 5 - Re s e rv e d b y In t e l

1 6 - "Flo a t in g - p o in t e rro r" ( fa u lt ) Th e flo a t in g - p o in t u n it in t e g ra t e d in t o t h e CPU ch ip h a s s ig n a le d a n e rro r co n d it io n , s u ch a s n u m e ric o ve rflo w o r d ivis io n b y 0 . [ 3 ] [3]

Th e 8 0 x 8 6 m icro p ro ce s s o rs a ls o g e n e ra t e t h is e xce p t io n wh e n p e rfo rm in g a s ig n e d d ivis io n wh o s e re s u lt ca n n o t b e s t o re d a s a s ig n e d in t e g e r ( fo r in s t a n ce , a d ivis io n b e t we e n - 2 1 4 7 4 8 3 6 4 8 a n d - 1 ) . 1 7 - "Alig n m e n t ch e ck " ( fa u lt ) Th e a d d re s s o f a n o p e ra n d is n o t co rre ct ly a lig n e d ( fo r in s t a n ce , t h e a d d re s s o f a lo n g in t e g e r is n o t a m u lt ip le o f 4 ) . 1 8 - "Ma ch in e ch e ck " ( a b o rt ) A m a ch in e - ch e ck m e ch a n is m h a s d e t e ct e d a CPU o r b u s e rro r. 1 9 - "S IMD flo a t in g p o in t " ( fa u lt ) Th e S S E o r S S E2 u n it in t e g ra t e d in t h e CPU ch ip h a s s ig n a le d a n e rro r co n d it io n o n a flo a t in g - p o in t o p e ra t io n . Th e va lu e s fro m 2 0 t o 3 1 a re re s e rve d b y In t e l fo r fu t u re d e ve lo p m e n t . As illu s t ra t e d in Ta b le 4 - 1 , e a ch e xce p t io n is h a n d le d b y a s p e cific e xce p t io n h a n d le r ( s e e S e ct io n 4 . 5 la t e r in t h is ch a p t e r) , wh ich u s u a lly s e n d s a Un ix s ig n a l t o t h e p ro ce s s t h a t ca u s e d t h e e xce p t io n .

Ta b le 4 - 1 . S ig n a ls s e n t b y t h e e x c e p t io n h a n d le rs

#

Ex c e p t io n

Ex c e p t io n h a n d le r

S ig n a l

0

Divid e e rro r

divide_error( )

SIGFPE

1

De b u g

debug( )

SIGTRAP

2

NMI

nmi( )

No n e

3

Bre a kp o in t

int3( )

SIGTRAP

4

Ove rflo w

overflow( )

SIGSEGV

5

Bo u n d s ch e ck

bounds( )

SIGSEGV

6

In va lid o p co d e

invalid_op( )

SIGILL

7

De vice n o t a va ila b le

device_not_available( )

SIGSEGV

8

Do u b le fa u lt

double_fault( )

SIGSEGV

9

Co p ro ce s s o r s e g m e n t o ve rru n

coprocessor_segment_overrun( )

SIGFPE

1 0 In va lid TS S

invalid_tss( )

SIGSEGV

1 1 S e g m e n t n o t p re s e n t

segment_not_present( )

SIGBUS

1 2 S t a ck e xce p t io n

stack_segment( )

SIGBUS

1 3 Ge n e ra l p ro t e ct io n

general_protection( )

SIGSEGV

1 4 Pa g e Fa u lt

page_fault( )

SIGSEGV

1 5 In t e l re s e rve d

No n e

No n e

1 6 Flo a t in g - p o in t e rro r

coprocessor_error( )

SIGFPE

1 7 Alig n m e n t ch e ck

alignment_check( )

SIGBUS

1 8 Ma ch in e ch e ck

machine_check( )

No n e

1 9 S IMD flo a t in g p o in t

simd_coprocessor_error( )

SIGFPE

4.2.3 Interrupt Descriptor Table A s ys t e m t a b le ca lle d In t e rru p t De s crip t o r Ta b le ( IDT) a s s o cia t e s e a ch in t e rru p t o r e xce p t io n ve ct o r wit h t h e a d d re s s o f t h e co rre s p o n d in g in t e rru p t o r e xce p t io n h a n d le r. Th e IDT m u s t b e p ro p e rly in it ia lize d b e fo re t h e ke rn e l e n a b le s in t e rru p t s . Th e IDT fo rm a t is s im ila r t o t h a t o f t h e GDT a n d t h e LDTs e xa m in e d in Ch a p t e r 2 . Ea ch e n t ry co rre s p o n d s t o a n in t e rru p t o r a n e xce p t io n ve ct o r a n d co n s is t s o f a n 8 - b yt e d e s crip t o r. Th u s , a m a xim u m o f 2 5 6 x 8 = 2 0 4 8 b yt e s a re re q u ire d t o s t o re t h e IDT. Th e idtr CPU re g is t e r a llo ws t h e IDT t o b e lo ca t e d a n ywh e re in m e m o ry: it s p e cifie s b o t h t h e IDT b a s e p h ys ica l a d d re s s a n d it s lim it ( m a xim u m le n g t h ) . It m u s t b e in it ia lize d b e fo re e n a b lin g in t e rru p t s b y u s in g t h e lidt a s s e m b ly la n g u a g e in s t ru ct io n .

Th e IDT m a y in clu d e t h re e t yp e s o f d e s crip t o rs ; Fig u re 4 - 2 illu s t ra t e s t h e m e a n in g o f t h e 6 4

b it s in clu d e d in e a ch o f t h e m . In p a rt icu la r, t h e va lu e o f t h e Type fie ld e n co d e d in t h e b it s 4 0 - 4 3 id e n t ifie s t h e d e s crip t o r t yp e . Fig u re 4 - 2 . Ga t e d e s c rip t o rs ' fo rm a t

Th e d e s crip t o rs a re : Ta s k g a t e In clu d e s t h e TS S s e le ct o r o f t h e p ro ce s s t h a t m u s t re p la ce t h e cu rre n t o n e wh e n a n in t e rru p t s ig n a l o ccu rs . Lin u x d o e s n o t u s e t a s k g a t e s . In t e rru p t g a t e In clu d e s t h e S e g m e n t S e le ct o r a n d t h e o ffs e t in s id e t h e s e g m e n t o f a n in t e rru p t o r e xce p t io n h a n d le r. Wh ile t ra n s fe rrin g co n t ro l t o t h e p ro p e r s e g m e n t , t h e p ro ce s s o r cle a rs t h e IF fla g , t h u s d is a b lin g fu rt h e r m a s ka b le in t e rru p t s . Tra p g a t e S im ila r t o a n in t e rru p t g a t e , e xce p t t h a t wh ile t ra n s fe rrin g co n t ro l t o t h e p ro p e r s e g m e n t , t h e p ro ce s s o r d o e s n o t m o d ify t h e IF fla g . As we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 4 . 4 . 1 , Lin u x u s e s in t e rru p t g a t e s t o h a n d le in t e rru p t s a n d t ra p g a t e s t o h a n d le e xce p t io n s .

4.2.4 Hardware Handling of Interrupts and Exceptions We n o w d e s crib e h o w t h e CPU co n t ro l u n it h a n d le s in t e rru p t s a n d e xce p t io n s . We a s s u m e t h a t t h e ke rn e l h a s b e e n in it ia lize d a n d t h u s t h e CPU is o p e ra t in g in Pro t e ct e d Mo d e . Aft e r e xe cu t in g a n in s t ru ct io n , t h e cs a n d eip p a ir o f re g is t e rs co n t a in t h e lo g ica l a d d re s s o f t h e n e xt in s t ru ct io n t o b e e xe cu t e d . Be fo re d e a lin g wit h t h a t in s t ru ct io n , t h e co n t ro l u n it ch e cks wh e t h e r a n in t e rru p t o r a n e xce p t io n o ccu rre d wh ile t h e co n t ro l u n it e xe cu t e d t h e p re vio u s in s t ru ct io n . If o n e o ccu rre d , t h e co n t ro l u n it d o e s t h e fo llo win g :

1 . De t e rm in e s t h e ve ct o r i ( 0 e xce p t io n .

i

2 5 5 ) a s s o cia t e d wit h t h e in t e rru p t o r t h e

2 . Re a d s t h e i t h e n t ry o f t h e IDT re fe rre d b y t h e idtr re g is t e r ( we a s s u m e in t h e fo llo win g d e s crip t io n t h a t t h e e n t ry co n t a in s a n in t e rru p t o r a t ra p g a t e ) . 3 . Ge t s t h e b a s e a d d re s s o f t h e GDT fro m t h e gdtr re g is t e r a n d lo o ks in t h e GDT t o re a d t h e S e g m e n t De s crip t o r id e n t ifie d b y t h e s e le ct o r in t h e IDT e n t ry. Th is d e s crip t o r s p e cifie s t h e b a s e a d d re s s o f t h e s e g m e n t t h a t in clu d e s t h e in t e rru p t o r e xce p t io n h a n d le r. 4 . Ma ke s s u re t h e in t e rru p t wa s is s u e d b y a n a u t h o rize d s o u rce . Firs t , it co m p a re s t h e Cu rre n t Privile g e Le ve l ( CPL) , wh ich is s t o re d in t h e t wo le a s t s ig n ifica n t b it s o f t h e cs re g is t e r, wit h t h e De s crip t o r Privile g e Le ve l ( DPL) o f t h e S e g m e n t De s crip t o r in clu d e d in t h e GDT. Ra is e s a "Ge n e ra l p ro t e ct io n " e xce p t io n if t h e CPL is lo we r t h a n t h e DPL b e ca u s e t h e in t e rru p t h a n d le r ca n n o t h a ve a lo we r p rivile g e t h a n t h e p ro g ra m t h a t ca u s e d t h e in t e rru p t . Fo r p ro g ra m m e d e xce p t io n s , it m a ke s a fu rt h e r s e cu rit y ch e ck. It co m p a re s t h e CPL wit h t h e DPL o f t h e g a t e d e s crip t o r in clu d e d in t h e IDT a n d ra is e s a "Ge n e ra l p ro t e ct io n " e xce p t io n if t h e DPL is lo we r t h a n t h e CPL. Th is la s t ch e ck m a ke s it p o s s ib le t o p re ve n t a cce s s b y u s e r a p p lica t io n s t o s p e cific t ra p o r in t e rru p t g a t e s . 5 . Ch e cks wh e t h e r a ch a n g e o f p rivile g e le ve l is t a kin g p la ce — t h a t is , if CPL is d iffe re n t fro m t h e s e le ct e d S e g m e n t De s crip t o r's DPL. If s o , t h e co n t ro l u n it m u s t s t a rt u s in g t h e s t a ck t h a t is a s s o cia t e d wit h t h e n e w p rivile g e le ve l. It d o e s t h is b y p e rfo rm in g t h e fo llo win g s t e p s : a . Re a d s t h e tr re g is t e r t o a cce s s t h e TS S s e g m e n t o f t h e ru n n in g p ro ce s s .

b . Lo a d s t h e ss a n d esp re g is t e rs wit h t h e p ro p e r va lu e s fo r t h e s t a ck s e g m e n t a n d s t a ck p o in t e r a s s o cia t e d wit h t h e n e w p rivile g e le ve l. Th e s e va lu e s a re fo u n d in t h e TS S ( s e e S e ct io n 3 . 3 . 2 ) . c. In t h e n e w s t a ck, s a ve s t h e p re vio u s va lu e s o f ss a n d esp, wh ich d e fin e t h e lo g ica l a d d re s s o f t h e s t a ck a s s o cia t e d wit h t h e o ld p rivile g e le ve l. 6 . If a fa u lt h a s o ccu rre d , lo a d s cs a n d eip wit h t h e lo g ica l a d d re s s o f t h e in s t ru ct io n t h a t ca u s e d t h e e xce p t io n s o t h a t it ca n b e e xe cu t e d a g a in . 7 . S a ve s t h e co n t e n t s o f eflags, cs, a n d eip in t h e s t a ck.

8 . If t h e e xce p t io n ca rrie s a h a rd wa re e rro r co d e , s a ve s it o n t h e s t a ck. 9 . Lo a d s cs a n d eip, re s p e ct ive ly, wit h t h e S e g m e n t S e le ct o r a n d t h e Offs e t fie ld s o f t h e Ga t e De s crip t o r s t o re d in t h e i t h e n t ry o f t h e IDT. Th e s e va lu e s d e fin e t h e lo g ica l a d d re s s o f t h e firs t in s t ru ct io n o f t h e in t e rru p t o r e xce p t io n h a n d le r. Th e la s t s t e p p e rfo rm e d b y t h e co n t ro l u n it is e q u iva le n t t o a ju m p t o t h e in t e rru p t o r e xce p t io n h a n d le r. In o t h e r wo rd s , t h e in s t ru ct io n p ro ce s s e d b y t h e co n t ro l u n it a ft e r d e a lin g wit h t h e in t e rru p t s ig n a l is t h e firs t in s t ru ct io n o f t h e s e le ct e d h a n d le r. Aft e r t h e in t e rru p t o r e xce p t io n is p ro ce s s e d , t h e co rre s p o n d in g h a n d le r m u s t re lin q u is h co n t ro l t o t h e in t e rru p t e d p ro ce s s b y is s u in g t h e iret in s t ru ct io n , wh ich fo rce s t h e co n t ro l u n it t o : 1 . Lo a d t h e cs, eip, a n d eflags re g is t e rs wit h t h e va lu e s s a ve d o n t h e s t a ck. If a h a rd wa re e rro r co d e h a s b e e n p u s h e d in t h e s t a ck o n t o p o f t h e eip co n t e n t s , it m u s t b e p o p p e d b e fo re e xe cu t in g iret.

2 . Ch e ck wh e t h e r t h e CPL o f t h e h a n d le r is e q u a l t o t h e va lu e co n t a in e d in t h e t wo le a s t s ig n ifica n t b it s o f cs ( t h is m e a n s t h e in t e rru p t e d p ro ce s s wa s ru n n in g a t t h e s a m e p rivile g e le ve l a s t h e h a n d le r) . If s o , iret co n clu d e s e xe cu t io n ; o t h e rwis e , g o t o t h e n e xt s t e p . 3 . Lo a d t h e ss a n d esp re g is t e rs fro m t h e s t a ck a n d re t u rn t o t h e s t a ck a s s o cia t e d wit h t h e o ld p rivile g e le ve l. 4 . Exa m in e t h e co n t e n t s o f t h e ds, es, fs, a n d gs s e g m e n t re g is t e rs ; if a n y o f t h e m co n t a in s a s e le ct o r t h a t re fe rs t o a S e g m e n t De s crip t o r wh o s e DPL va lu e is lo we r t h a n CPL, cle a r t h e co rre s p o n d in g s e g m e n t re g is t e r. Th e co n t ro l u n it d o e s t h is t o fo rb id Us e r Mo d e p ro g ra m s t h a t ru n wit h a CPL e q u a l t o 3 fro m u s in g s e g m e n t re g is t e rs p re vio u s ly u s e d b y ke rn e l ro u t in e s ( wit h a DPL e q u a l t o 0 ) . If t h e s e re g is t e rs we re n o t cle a re d , m a licio u s Us e r Mo d e p ro g ra m s co u ld e xp lo it t h e m in o rd e r t o a cce s s t h e ke rn e l a d d re s s s p a ce .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

4.3 Nested Execution of Exception and Interrupt Handlers Wh e n h a n d lin g a n in t e rru p t o r a n e xce p t io n , t h e ke rn e l b e g in s a n e w k e rn e l co n t ro l p a t h , o r s e p a ra t e s e q u e n ce o f in s t ru ct io n s . Wh e n a p ro ce s s is s u e s a s ys t e m ca ll re q u e s t , fo r in s t a n ce , t h e firs t in s t ru ct io n s o f t h e co rre s p o n d in g ke rn e l co n t ro l p a t h a re t h o s e t h a t s a ve t h e co n t e n t o f t h e re g is t e rs in t h e Ke rn e l Mo d e s t a ck, wh ile t h e la s t in s t ru ct io n s a re t h o s e t h a t re s t o re t h e co n t e n t o f t h e re g is t e rs a n d p u t t h e CPU b a ck in t o Us e r Mo d e . Lin u x d e s ig n d o e s n o t a llo w p ro ce s s s wit ch in g wh ile t h e CPU is e xe cu t in g a ke rn e l co n t ro l p a t h a s s o cia t e d wit h a n in t e rru p t . Ho we ve r, s u ch ke rn e l co n t ro l p a t h s m a y b e a rb it ra rily n e s t e d ; a n in t e rru p t h a n d le r m a y b e in t e rru p t e d b y a n o t h e r in t e rru p t h a n d le r, t h u s g ivin g ra is e t o a n e s t e d e xe cu t io n o f ke rn e l t h re a d s . We e m p h a s ize t h a t t h e cu rre n t p ro ce s s d o e s n 't ch a n g e wh ile t h e ke rn e l is h a n d lin g a n e s t e d s e t o f ke rn e l co n t ro l p a t h s . As s u m in g t h a t t h e ke rn e l is b u g fre e , m o s t e xce p t io n s ca n o ccu r o n ly wh ile t h e CPU is in Us e r Mo d e . In d e e d , t h e y a re e it h e r ca u s e d b y p ro g ra m m in g e rro rs o r t rig g e re d b y d e b u g g e rs . Ho we ve r, t h e Pa g e Fa u lt e xce p t io n m a y o ccu r in Ke rn e l Mo d e . Th is h a p p e n s wh e n t h e p ro ce s s a t t e m p t s t o a d d re s s a p a g e t h a t b e lo n g s t o it s a d d re s s s p a ce b u t is n o t cu rre n t ly in RAM. Wh ile h a n d lin g s u ch a n e xce p t io n , t h e ke rn e l m a y s u s p e n d t h e cu rre n t p ro ce s s a n d re p la ce it wit h a n o t h e r o n e u n t il t h e re q u e s t e d p a g e is a va ila b le . Th e ke rn e l co n t ro l p a t h t h a t h a n d le s t h e Pa g e fa u lt e xce p t io n re s u m e s e xe cu t io n a s s o o n a s t h e p ro ce s s g e t s t h e p ro ce s s o r a g a in . S in ce t h e Pa g e Fa u lt e xce p t io n h a n d le r n e ve r g ive s ris e t o fu rt h e r e xce p t io n s , a t m o s t t wo ke rn e l co n t ro l p a t h s a s s o cia t e d wit h e xce p t io n s ( t h e firs t o n e ca u s e d b y a s ys t e m ca ll in vo ca t io n , t h e s e co n d o n e ca u s e d b y a Pa g e Fa u lt ) m a y b e s t a cke d , o n e o n t o p o f t h e o t h e r. In co n t ra s t t o e xce p t io n s , in t e rru p t s is s u e d b y I/ O d e vice s d o n o t re fe r t o d a t a s t ru ct u re s s p e cific t o t h e cu rre n t p ro ce s s , a lt h o u g h t h e ke rn e l co n t ro l p a t h s t h a t h a n d le t h e m ru n o n b e h a lf o f t h a t p ro ce s s . As a m a t t e r o f fa ct , it is im p o s s ib le t o p re d ict wh ich p ro ce s s will b e ru n n in g wh e n a g ive n in t e rru p t o ccu rs . An in t e rru p t h a n d le r m a y p re e m p t b o t h o t h e r in t e rru p t h a n d le rs a n d e xce p t io n h a n d le rs . Co n ve rs e ly, a n e xce p t io n h a n d le r n e ve r p re e m p t s a n in t e rru p t h a n d le r. Th e o n ly e xce p t io n t h a t ca n b e t rig g e re d in Ke rn e l Mo d e is Pa g e Fa u lt , wh ich we ju s t d e s crib e d . Bu t in t e rru p t h a n d le rs n e ve r p e rfo rm o p e ra t io n s t h a t ca n in d u ce Pa g e Fa u lt s , a n d t h u s , p o t e n t ia lly, p ro ce s s s wit ch . Lin u x in t e rle a ve s ke rn e l co n t ro l p a t h s fo r t wo m a jo r re a s o n s : ●



To im p ro ve t h e t h ro u g h p u t o f p ro g ra m m a b le in t e rru p t co n t ro lle rs a n d d e vice co n t ro lle rs . As s u m e t h a t a d e vice co n t ro lle r is s u e s a s ig n a l o n a n IRQ lin e : t h e PIC t ra n s fo rm s it in t o a n e xt e rn a l in t e rru p t , a n d t h e n b o t h t h e PIC a n d t h e d e vice co n t ro lle r re m a in b lo cke d u n t il t h e PIC re ce ive s a n a ckn o wle d g m e n t fro m t h e CPU. Th a n ks t o ke rn e l co n t ro l p a t h in t e rle a vin g , t h e ke rn e l is a b le t o s e n d t h e a ckn o wle d g m e n t e ve n wh e n it is h a n d lin g a p re vio u s in t e rru p t . To im p le m e n t a n in t e rru p t m o d e l wit h o u t p rio rit y le ve ls . S in ce e a ch in t e rru p t h a n d le r m a y b e d e fe rre d b y a n o t h e r o n e , t h e re is n o n e e d t o e s t a b lis h p re d e fin e d p rio rit ie s a m o n g h a rd wa re d e vice s . Th is s im p lifie s t h e ke rn e l co d e a n d im p ro ve s it s p o rt a b ilit y.

On m u lt ip ro ce s s o r s ys t e m s , s e ve ra l ke rn e l co n t ro l p a t h s m a y e xe cu t e co n cu rre n t ly. Mo re o ve r, a ke rn e l co n t ro l p a t h a s s o cia t e d wit h a n e xce p t io n m a y s t a rt e xe cu t in g o n a CPU

a n d , d u e t o a p ro ce s s s wit ch , m ig ra t e o n a n o t h e r CPU.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

4.4 Initializing the Interrupt Descriptor Table No w t h a t yo u u n d e rs t a n d wh a t t h e In t e l p ro ce s s o r d o e s wit h in t e rru p t s a n d e xce p t io n s a t t h e h a rd wa re le ve l, we ca n m o ve o n t o d e s crib e h o w t h e In t e rru p t De s crip t o r Ta b le is in it ia lize d . Re m e m b e r t h a t b e fo re t h e ke rn e l e n a b le s t h e in t e rru p t s , it m u s t lo a d t h e in it ia l a d d re s s o f t h e IDT t a b le in t o t h e idtr re g is t e r a n d in it ia lize a ll t h e e n t rie s o f t h a t t a b le . Th is a ct ivit y is d o n e wh ile in it ia lizin g t h e s ys t e m ( s e e Ap p e n d ix A) . Th e int in s t ru ct io n a llo ws a Us e r Mo d e p ro ce s s t o is s u e a n in t e rru p t s ig n a l t h a t h a s a n a rb it ra ry ve ct o r ra n g in g fro m 0 t o 2 5 5 . Th e re fo re , in it ia liza t io n o f t h e IDT m u s t b e d o n e ca re fu lly, t o b lo ck ille g a l in t e rru p t s a n d e xce p t io n s s im u la t e d b y Us e r Mo d e p ro ce s s e s via int in s t ru ct io n s . Th is ca n b e a ch ie ve d b y s e t t in g t h e DPL fie ld o f t h e In t e rru p t o r Tra p Ga t e De s crip t o r t o 0 . If t h e p ro ce s s a t t e m p t s t o is s u e o n e o f t h e s e in t e rru p t s ig n a ls , t h e co n t ro l u n it ch e cks t h e CPL va lu e a g a in s t t h e DPL fie ld a n d is s u e s a "Ge n e ra l p ro t e ct io n " e xce p t io n . In a fe w ca s e s , h o we ve r, a Us e r Mo d e p ro ce s s m u s t b e a b le t o is s u e a p ro g ra m m e d e xce p t io n . To a llo w t h is , it is s u fficie n t t o s e t t h e DPL fie ld o f t h e co rre s p o n d in g In t e rru p t o r Tra p Ga t e De s crip t o rs t o 3 — t h a t is , a s h ig h a s p o s s ib le . Le t 's n o w s e e h o w Lin u x im p le m e n t s t h is s t ra t e g y.

4.4.1 Interrupt, Trap, and System Gates As m e n t io n e d in t h e e a rlie r s e ct io n S e ct io n 4 . 2 . 3 , In t e l p ro vid e s t h re e t yp e s o f in t e rru p t d e s crip t o rs : Ta s k, In t e rru p t , a n d Tra p Ga t e De s crip t o rs . Ta s k Ga t e De s crip t o rs a re irre le va n t t o Lin u x, b u t it s In t e rru p t De s crip t o r Ta b le co n t a in s s e ve ra l In t e rru p t a n d Tra p Ga t e De s crip t o rs . Lin u x cla s s ifie s t h e m a s fo llo ws , u s in g a s lig h t ly d iffe re n t b re a kd o wn a n d t e rm in o lo g y fro m In t e l: In t e rru p t g a t e An In t e l in t e rru p t g a t e t h a t ca n n o t b e a cce s s e d b y a Us e r Mo d e p ro ce s s ( t h e g a t e 's DPL fie ld is e q u a l t o 0 ) . All Lin u x in t e rru p t h a n d le rs a re a ct iva t e d b y m e a n s o f in t e rru p t g a t e s , a n d a ll a re re s t rict e d t o Ke rn e l Mo d e . S y s te m gate An In t e l t ra p g a t e t h a t ca n b e a cce s s e d b y a Us e r Mo d e p ro ce s s ( t h e g a t e 's DPL fie ld is e q u a l t o 3 ) . Th e fo u r Lin u x e xce p t io n h a n d le rs a s s o cia t e d wit h t h e ve ct o rs 3 , 4 , 5 , a n d 1 2 8 a re a ct iva t e d b y m e a n s o f s ys t e m g a t e s , s o t h e fo u r a s s e m b ly la n g u a g e in s t ru ct io n s int3, into, bound, a n d int $0x80 ca n b e is s u e d in Us e r Mo d e .

Tra p g a t e An In t e l t ra p g a t e t h a t ca n n o t b e a cce s s e d b y a Us e r Mo d e p ro ce s s ( t h e g a t e 's DPL fie ld is e q u a l t o 0 ) . Mo s t Lin u x e xce p t io n h a n d le rs a re a ct iva t e d b y m e a n s o f t ra p ga te s.

Th e fo llo win g a rch it e ct u re - d e p e n d e n t fu n ct io n s a re u s e d t o in s e rt g a t e s in t h e IDT:

set_intr_gate(n,addr) In s e rt s a n in t e rru p t g a t e in t h e n t h IDT e n t ry. Th e S e g m e n t S e le ct o r in s id e t h e g a t e is s e t t o t h e ke rn e l co d e 's S e g m e n t S e le ct o r. Th e Offs e t fie ld is s e t t o addr, wh ich is t h e a d d re s s o f t h e in t e rru p t h a n d le r. Th e DPL fie ld is s e t t o 0 .

set_system_gate(n,addr) In s e rt s a t ra p g a t e in t h e n t h IDT e n t ry. Th e S e g m e n t S e le ct o r in s id e t h e g a t e is s e t t o t h e ke rn e l co d e 's S e g m e n t S e le ct o r. Th e Offs e t fie ld is s e t t o addr, wh ich is t h e a d d re s s o f t h e e xce p t io n h a n d le r. Th e DPL fie ld is s e t t o 3 .

set_trap_gate(n,addr) S im ila r t o t h e p re vio u s fu n ct io n , e xce p t t h e DPL fie ld is s e t t o 0 .

4.4.2 Preliminary Initialization of the IDT Th e IDT is in it ia lize d a n d u s e d b y t h e BIOS ro u t in e s wh e n t h e co m p u t e r s t ill o p e ra t e s in Re a l Mo d e . On ce Lin u x t a ke s o ve r, h o we ve r, t h e IDT is m o ve d t o a n o t h e r a re a o f RAM a n d in it ia lize d a s e co n d t im e , s in ce Lin u x d o e s n o t u s e a n y BIOS ro u t in e s ( s e e Ap p e n d ix A) . Th e idt va ria b le p o in t s t o t h e IDT, wh ile t h e IDT it s e lf is s t o re d in t h e idt_table t a b le , wh ich in clu d e s 2 5 6 e n t rie s . [ 4 ] Th e 6 - b yt e idt_descr va ria b le s t o re s b o t h t h e s ize o f t h e IDT a n d it s a d d re s s a n d is u s e d o n ly wh e n t h e ke rn e l in it ia lize s t h e idtr re g is t e r wit h t h e

lidt a s s e m b ly la n g u a g e in s t ru ct io n . [4]

S o m e Pe n t iu m m o d e ls h a ve t h e n o t o rio u s "f0 0 f" b u g , wh ich a llo ws a Us e r Mo d e p ro g ra m t o fre e ze t h e s ys t e m . Wh e n e xe cu t in g o n s u ch CPUs , Lin u x u s e s a wo rka ro u n d b a s e d o n s t o rin g t h e IDT in a writ e - p ro t e ct e d p a g e fra m e . Th e wo rka ro u n d fo r t h e b u g is o ffe re d a s a n o p t io n wh e n t h e u s e r co m p ile s t h e ke rn e l.

Du rin g ke rn e l in it ia liza t io n , t h e setup_idt( ) a s s e m b ly la n g u a g e fu n ct io n s t a rt s b y fillin g a ll 2 5 6 e n t rie s o f idt_table wit h t h e s a m e in t e rru p t g a t e , wh ich re fe rs t o t h e

ignore_int( ) in t e rru p t h a n d le r: setup_idt: lea ignore_int, %edx movl $(_ _KERNEL_CS enable(irq); } spin_lock_irqrestore(&(irq_desc[irq].lock), flags); Th e fu n ct io n d e t e ct s t h a t a n in t e rru p t wa s lo s t b y ch e ckin g t h e va lu e o f t h e IRQ_PENDING fla g . Th e fla g is a lwa ys cle a re d wh e n le a vin g t h e in t e rru p t h a n d le r; t h e re fo re , if t h e IRQ lin e is d is a b le d a n d t h e fla g is s e t , t h e n a n in t e rru p t o ccu rre n ce h a s b e e n a ckn o wle d g e d b u t n o t ye t s e rvice d . In t h is ca s e it is n e ce s s a ry t o is s u e a n e w in t e rru p t . Th is is o b t a in e d b y fo rcin g t h e lo ca l APIC t o g e n e ra t e a s e lf- in t e rru p t ( s e e t h e la t e r s e ct io n S e ct io n 4 . 6 . 2 ) . Th e ro le o f t h e IRQ_REPLAY fla g is t o e n s u re t h a t e xa ct ly o n e s e lf- in t e rru p t is g e n e ra t e d . Re m e m b e r t h a t t h e do_IRQ( ) fu n ct io n cle a rs t h a t fla g wh e n it s t a rt s h a n d lin g t h e in t e rru p t .

4.6.1.7 Interrupt service routines As m e n t io n e d p re vio u s ly, a n in t e rru p t s e rvice ro u t in e im p le m e n t s a d e vice - s p e cific o p e ra t io n . Wh e n a n in t e rru p t h a n d le r m u s t e xe cu t e t h e IS Rs , it in vo ke s t h e handle_IRQ_event( ) fu n ct io n . Th is fu n ct io n e s s e n t ia lly p e rfo rm s t h e s t e p s s h o wn in t h e fo llo win g lis t . 1 . In vo ke s t h e irq_enter( ) fu n ct io n t o in cre m e n t t h e _ _local_irq_count fie ld o f t h e irq_stat e n t ry o f t h e e xe cu t in g CPU ( t o le a rn h o w m a n y in t e rru p t h a n d le rs a re s t a cke d in t h e CPU, s e e t h e e a rlie r s e ct io n S e ct io n 4 . 6 . 1 . 2 ) . As we s h a ll s e e in Ch a p t e r 5 , t h is fu n ct io n a ls o ch e cks t h a t in t e rru p t s a re n o t g lo b a lly d is a b le d . 2 . En a b le s t h e lo ca l in t e rru p t s wit h t h e sti a s s e m b ly la n g u a g e in s t ru ct io n if t h e SA_INTERRUPT fla g is cle a r.

3 . Exe cu t e s e a ch in t e rru p t s e rvice ro u t in e o f t h e in t e rru p t t h ro u g h t h e fo llo win g co d e : do { action->handler(irq, action->dev_id, regs); action = action->next; } while (action);

At t h e s t a rt o f t h e lo o p , action p o in t s t o t h e s t a rt o f a lis t o f irqaction d a t a s t ru ct u re s t h a t in d ica t e t h e a ct io n s t o b e t a ke n u p o n re ce ivin g t h e in t e rru p t ( s e e Fig u re 4 - 4 e a rlie r in t h is ch a p t e r) . 4 . Dis a b le s t h e lo ca l in t e rru p t s wit h t h e cli a s s e m b ly la n g u a g e in s t ru ct io n .

5 . In vo ke s irq_exit( ) t o d e cre m e n t t h e _ _local_irq_count fie ld o f t h e irq_stat e n t ry o f t h e e xe cu t in g CPU. All in t e rru p t s e rvice ro u t in e s a ct o n t h e s a m e p a ra m e t e rs :

irq Th e IRQ n u m b e r

dev_id Th e d e vice id e n t ifie r

regs A p o in t e r t o t h e Ke rn e l Mo d e s t a ck a re a co n t a in in g t h e re g is t e rs s a ve d rig h t a ft e r t h e in t e rru p t o ccu rre d Th e firs t p a ra m e t e r a llo ws a s in g le IS R t o h a n d le s e ve ra l IRQ lin e s , t h e s e co n d o n e a llo ws a s in g le IS R t o t a ke ca re o f s e ve ra l d e vice s o f t h e s a m e t yp e , a n d t h e la s t o n e a llo ws t h e IS R t o a cce s s t h e e xe cu t io n co n t e xt o f t h e in t e rru p t e d ke rn e l co n t ro l p a t h . In p ra ct ice , m o s t IS Rs d o n o t u s e t h e s e p a ra m e t e rs . Th e SA_INTERRUPT fla g o f t h e m a in IRQ d e s crip t o r d e t e rm in e s wh e t h e r in t e rru p t s m u s t b e e n a b le d o r d is a b le d wh e n t h e do_IRQ( ) fu n ct io n in vo ke s a n IS R. An IS R t h a t h a s b e e n in vo ke d wit h t h e in t e rru p t s in o n e s t a t e is a llo we d t o p u t t h e m in t h e o p p o s it e s t a t e . In a u n ip ro ce s s o r s ys t e m , t h is ca n b e a ch ie ve d b y m e a n s o f t h e cli ( d is a b le in t e rru p t s ) a n d sti ( e n a b le in t e rru p t s ) a s s e m b ly la n g u a g e in s t ru ct io n s . Glo b a lly e n a b lin g o r d is a b lin g in t e rru p t s in a m u lt ip ro ce s s o r s ys t e m is a m u ch m o re co m p lica t e d t a s k; we 'll d e a l wit h it in Ch a p t e r 5 . Th e s t ru ct u re o f a n IS R d e p e n d s o n t h e ch a ra ct e ris t ics o f t h e d e vice h a n d le d . We 'll g ive a fe w e xa m p le s o f IS Rs in Ch a p t e r 6 , Ch a p t e r 1 3 , a n d Ch a p t e r 1 8 .

4.6.1.8 Dynamic allocation of IRQ lines As n o t ice d in s e ct io n S e ct io n 4 . 6 . 1 . 1 , a fe w ve ct o rs a re re s e rve d fo r s p e cific d e vice s , wh ile t h e re m a in in g o n e s a re d yn a m ica lly h a n d le d . Th e re is , t h e re fo re , a wa y in wh ich t h e s a m e IRQ lin e ca n b e u s e d b y s e ve ra l h a rd wa re d e vice s e ve n if t h e y d o n o t a llo w IRQ s h a rin g . Th e t rick is t o s e ria lize t h e a ct iva t io n o f t h e h a rd wa re d e vice s s o t h a t ju s t o n e o wn s t h e IRQ lin e a t a t im e . Be fo re a ct iva t in g a d e vice t h a t is g o in g t o u s e a n IRQ lin e , t h e co rre s p o n d in g d rive r in vo ke s request_irq( ). Th is fu n ct io n cre a t e s a n e w irqaction d e s crip t o r a n d in it ia lize s it wit h t h e p a ra m e t e r va lu e s ; it t h e n in vo ke s t h e

setup_irq( ) fu n ct io n t o in s e rt t h e d e s crip t o r in t h e p ro p e r IRQ lis t . Th e d e vice d rive r a b o rt s t h e o p e ra t io n if setup_irq( ) re t u rn s a n e rro r co d e , wh ich m e a n s t h a t t h e IRQ lin e is a lre a d y in u s e b y a n o t h e r d e vice t h a t d o e s n o t a llo w in t e rru p t s h a rin g . Wh e n t h e d e vice o p e ra t io n is co n clu d e d , t h e d rive r in vo ke s t h e free_irq( ) fu n ct io n t o re m o ve t h e d e s crip t o r fro m t h e IRQ lis t a n d re le a s e t h e m e m o ry a re a . Le t 's s e e h o w t h is s ch e m e wo rks o n a s im p le e xa m p le . As s u m e a p ro g ra m wa n t s t o a d d re s s t h e / d e v / fd 0 d e vice file , wh ich co rre s p o n d s t o t h e firs t flo p p y d is k o n t h e s ys t e m . [ 1 1 ] [11]

Flo p p y d is ks a re "o ld " d e vice s t h a t d o n o t u s u a lly a llo w IRQ s h a rin g .

Th e p ro g ra m ca n d o t h is e it h e r b y d ire ct ly a cce s s in g / d e v / fd 0 o r b y m o u n t in g a file s ys t e m o n it . Flo p p y d is k co n t ro lle rs a re u s u a lly a s s ig n e d IRQ 6 ; g ive n t h is , t h e flo p p y d rive r is s u e s t h e fo llo win g re q u e s t :

request_irq(6, floppy_interrupt, SA_INTERRUPT|SA_SAMPLE_RANDOM, "floppy", NULL); As ca n b e o b s e rve d , t h e floppy_interrupt( ) in t e rru p t s e rvice ro u t in e m u s t e xe cu t e wit h t h e in t e rru p t s d is a b le d ( SA_INTERRUPT s e t ) a n d n o s h a rin g o f t h e IRQ ( SA_SHIRQ fla g cle a re d ) . Th e SA_SAMPLE_RANDOM fla g s e t m e a n s t h a t a cce s s e s t o t h e flo p p y d is k a re a g o o d s o u rce o f ra n d o m e ve n t s t o b e u s e d fo r t h e ke rn e l ra n d o m n u m b e r g e n e ra t o r. Wh e n t h e o p e ra t io n o n t h e flo p p y d is k is co n clu d e d ( e it h e r t h e I/ O o p e ra t io n o n / d e v / fd 0 t e rm in a t e s o r t h e file s ys t e m is u n m o u n t e d ) , t h e d rive r re le a s e s IRQ 6 :

free_irq(6, NULL); To in s e rt a n irqaction d e s crip t o r in t h e p ro p e r lis t , t h e ke rn e l in vo ke s t h e setup_irq( ) fu n ct io n , p a s s in g t o it t h e p a ra m e t e rs irq _nr, t h e IRQ n u m b e r, a n d new ( t h e a d d re s s o f a p re vio u s ly a llo ca t e d irqaction d e s crip t o r) . Th is fu n ct io n : 1 . Ch e cks wh e t h e r a n o t h e r d e vice is a lre a d y u s in g t h e irq _nr IRQ a n d , if s o , wh e t h e r t h e SA_SHIRQ fla g s in t h e irqaction d e s crip t o rs o f b o t h d e vice s s p e cify t h a t t h e IRQ lin e ca n b e s h a re d . Re t u rn s a n e rro r co d e if t h e IRQ lin e ca n n o t b e u s e d . 2 . Ad d s *new ( t h e n e w irqaction d e s crip t o r p o in t e d t o b y new) a t t h e e n d o f t h e lis t t o wh ich irq

_desc[irq _nr]->action p o in t s . 3 . If n o o t h e r d e vice is s h a rin g t h e s a m e IRQ, cle a rs t h e IRQ _DISABLED, IRQ_AUTODETECT, a n d IRQ

_INPROGRESS fla g s in t h e flags fie ld o f *new a n d in vo ke s t h e startup m e t h o d o f t h e irq_desc[irq_nr]->handler PIC o b je ct t o m a ke s u re t h a t IRQ s ig n a ls a re e n a b le d . He re is a n e xa m p le o f h o w setup_irq( ) is u s e d , d ra wn fro m s ys t e m in it ia liza t io n . Th e ke rn e l in it ia lize s t h e irq0 d e s crip t o r o f t h e in t e rva l t im e r d e vice b y e xe cu t in g t h e fo llo win g in s t ru ct io n s in t h e time_init( ) fu n ct io n ( s e e Ch a p t e r 6 ) :

struct irqaction irq0 = {timer_interrupt, SA_INTERRUPT, 0, "timer", NULL,}; setup_irq(0, &irq0); Firs t , t h e irq0 va ria b le o f t yp e irqaction is in it ia lize d : t h e handler fie ld is s e t t o t h e a d d re s s o f t h e

timer_interrupt( ) fu n ct io n , t h e flags fie ld is s e t t o SA_INTERRUPT, t h e name fie ld is s e t t o " timer", a n d t h e la s t fie ld is s e t t o NULL t o s h o w t h a t n o dev_id va lu e is u s e d . Ne xt , t h e ke rn e l in vo ke s setup_irq( ) t o in s e rt irq0 in t h e lis t o f irqaction d e s crip t o rs a s s o cia t e d wit h IRQ0 .

4.6.2 Interprocessor Interrupt Handling On m u lt ip ro ce s s o r s ys t e m s , Lin u x d e fin e s t h e fo llo win g five kin d s o f in t e rp ro ce s s o r in t e rru p t s ( s e e a ls o Ta b le 4 2):

CALL_FUNCTION_VECTOR ( v e ct o r 0xfb) S e n t t o a ll CPUs b u t t h e s e n d e r, fo rcin g t h o s e CPUs t o ru n a fu n ct io n p a s s e d b y t h e s e n d e r. Th e co rre s p o n d in g in t e rru p t h a n d le r is n a m e d call_function_interrupt( ). Th e fu n ct io n p a s s e d a s a p a ra m e t e r m a y, fo r in s t a n ce , fo rce a ll o t h e r CPUs t o s t o p , o r m a y fo rce t h e m t o s e t t h e co n t e n t s o f t h e Me m o ry Typ e Ra n g e Re g is t e rs ( MTRRs ) . [ 1 2 ] Us u a lly t h is in t e rru p t is s e n t t o a ll CPUs e xce p t t h e CPU e xe cu t in g t h e ca llin g fu n ct io n b y m e a n s o f t h e smp_call_function( ) fa cilit y fu n ct io n .

[ 1 2 ] S t a rt in g wit h t h e Pe n t iu m Pro m o d e l, In t e l m icro p ro ce s s o rs in clu d e t h e s e a d d it io n a l re g is t e rs t o e a s ily cu s t o m ize ca ch e o p e ra t io n s . Fo r in s t a n ce , Lin u x m a y u s e t h e s e re g is t e rs t o d is a b le t h e h a rd wa re ca ch e fo r t h e a d d re s s e s m a p p in g t h e fra m e b u ffe r o f a PCI/ AGP g ra p h ic ca rd wh ile m a in t a in in g t h e "writ e co m b in in g " m o d e o f o p e ra t io n : t h e p a g in g u n it co m b in e s writ e t ra n s fe rs in t o la rg e r ch u n ks b e fo re co p yin g t h e m in t o t h e fra m e b u ffe r.

RESCHEDULE_VECTOR ( v e ct o r 0xfc) Wh e n a CPU re ce ive s t h is t yp e o f in t e rru p t , t h e co rre s p o n d in g h a n d le r — n a m e d reschedule_interrupt( ) — lim it s it s e lf t o a ckn o wle d g e t h e in t e rru p t . All t h e re s ch e d u lin g is d o n e a u t o m a t ica lly wh e n re t u rn in g fro m t h e in t e rru p t ( s e e S e ct io n 4 . 8 la t e r in t h is ch a p t e r) .

INVALIDATE_TLB_VECTOR ( v e ct o r 0xfd) S e n t t o a ll CPUs b u t t h e s e n d e r, fo rcin g t h e m t o in va lid a t e t h e ir Tra n s la t io n Lo o ka s id e Bu ffe rs . Th e co rre s p o n d in g h a n d le r, n a m e d invalidate_interrupt( ), flu s h e s s o m e TLB e n t rie s o f t h e p ro ce s s o r a s d e s crib e d in S e ct io n 2 . 5 . 7 .

ERROR_APIC_VECTOR ( v e ct o r 0xfe) Th is in t e rru p t s h o u ld n e ve r o ccu r.

SPURIOUS_APIC_VECTOR ( v e ct o r 0xff) Th is in t e rru p t s h o u ld n e ve r o ccu r.

Th a n ks t o t h e fo llo win g g ro u p o f fu n ct io n s , is s u in g in t e rp ro ce s s o r in t e rru p t s ( IPIs ) b e co m e s a n e a s y t a s k:

send_IPI_all( ) S e n d s a n IPI t o a ll CPUs ( in clu d in g t h e s e n d e r)

send_IPI_allbutself( ) S e n d s a n IPI t o a ll CPUs e xce p t t h e s e n d e r

send_IPI_self( ) S e n d s a n IPI t o t h e s e n d e r CPU

send_IPI_mask( ) S e n d s a n IPI t o a g ro u p o f CPUs s p e cifie d b y a b it m a s k Th e a s s e m b ly la n g u a g e co d e o f t h e in t e rp ro ce s s o r in t e rru p t h a n d le rs is g e n e ra t e d b y t h e BUILD_SMP_INTERRUPT m a cro ; t h e co d e is a lm o s t id e n t ica l t o t h e co d e g e n e ra t e d b y t h e BUILD_IRQ m a cro ( s e e t h e e a rlie r s e ct io n S e ct io n 4 . 6 . 1 . 4 ) . Ea ch in t e rp ro ce s s o r in t e rru p t h a s a d iffe re n t h ig h - le ve l h a n d le r, wh ich h a s t h e s a m e n a m e a s t h e lo w- le ve l h a n d le r p re ce d e d b y smp_. Fo r in s t a n ce , t h e h ig h - le ve l h a n d le r o f t h e RESCHEDULE_VECTOR in t e rp ro ce s s o r in t e rru p t t h a t is in vo ke d b y t h e lo w- le ve l reschedule_interrupt( ) h a n d le r is n a m e d

smp_reschedule_interrupt( ). Ea ch h ig h - le ve l h a n d le r a ckn o wle d g e s t h e in t e rp ro ce s s o r in t e rru p t o n t h e lo ca l APIC a n d t h e n p e rfo rm s t h e s p e cific a ct io n t rig g e re d b y t h e in t e rru p t . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

4.7 Softirqs, Tasklets, and Bottom Halves We m e n t io n e d e a rlie r in S e ct io n 4 . 6 t h a t s e ve ra l t a s ks a m o n g t h o s e e xe cu t e d b y t h e ke rn e l a re n o t crit ica l: t h e y ca n b e d e fe rre d fo r a lo n g p e rio d o f t im e , if n e ce s s a ry. Re m e m b e r t h a t t h e in t e rru p t s e rvice ro u t in e s o f a n in t e rru p t h a n d le r a re s e ria lize d , a n d o ft e n t h e re s h o u ld b e n o o ccu rre n ce o f a n in t e rru p t u n t il t h e co rre s p o n d in g in t e rru p t h a n d le r h a s t e rm in a t e d . Co n ve rs e ly, t h e d e fe rra b le t a s ks ca n e xe cu t e wit h a ll in t e rru p t s e n a b le d . Ta kin g t h e m o u t o f t h e in t e rru p t h a n d le r h e lp s ke e p ke rn e l re s p o n s e t im e s m a ll. Th is is a ve ry im p o rt a n t p ro p e rt y fo r m a n y t im e - crit ica l a p p lica t io n s t h a t e xp e ct t h e ir in t e rru p t re q u e s t s t o b e s e rvice d in a fe w m illis e co n d s . Lin u x 2 . 4 a n s we rs s u ch a ch a lle n g e b y u s in g t h re e kin d s o f d e fe rra b le a n d in t e rru p t ib le ke rn e l fu n ct io n s ( in s h o rt , d e fe rra b le fu n ct io n s [ 1 3 ] ) : s o ft irq s , t a s k le t s , a n d b o t t o m h a lv e s . Alt h o u g h t h e s e t h re e kin d s o f d e fe rra b le fu n ct io n s wo rk in d iffe re n t wa ys , t h e y a re s t rict ly co rre la t e d . Ta s kle t s a re im p le m e n t e d o n t o p o f s o ft irq s , a n d b o t t o m h a lve s a re im p le m e n t e d b y m e a n s o f t a s kle t s . As a m a t t e r o f fa ct , t h e t e rm "s o ft irq , " wh ich a p p e a rs in t h e ke rn e l s o u rce co d e , o ft e n d e n o t e s a ll kin d s o f d e fe rra b le fu n ct io n s . [13]

Th e s e a re a ls o ca lle d s o ft w a re in t e rru p t s , b u t we d e n o t e t h e m a s "d e fe rra b le fu n ct io n s " t o a vo id co n fu s io n wit h p ro g ra m m e d e xce p t io n s , wh ich a re re fe rre d t o a s "s o ft wa re in t e rru p t s " in In t e l m a n u a ls .

As a g e n e ra l ru le , n o s o ft irq ca n b e in t e rru p t e d t o ru n a n o t h e r s o ft irq o n t h e s a m e CPU; t h e s a m e ru le h o ld s fo r t a s kle t s a n d b o t t o m h a lve s b u ilt o n t o p o f s o ft irq s . On a m u lt ip ro ce s s o r s ys t e m , h o we ve r, s e ve ra l d e fe rra b le fu n ct io n s ca n ru n co n cu rre n t ly o n d iffe re n t CPUs . Th e d e g re e o f co n cu rre n cy d e p e n d s o n t h e t yp e o f d e fe rra b le fu n ct io n , a s s h o wn in Ta b le 4 - 6 .

Ta b le 4 - 6 . D iffe re n c e s b e t w e e n s o ft irq s , t a s k le t s , a n d b o t t o m h a lv e s

D e fe rra b le fu n c t io n D y n a m ic a llo c a t io n Co n c u rre n c y

S o ft irq

No

S o ft irq s o f t h e s a m e t yp e ca n ru n co n cu rre n t ly o n s e ve ra l CPUs .

Ta s kle t

Ye s

Ta s kle t s o f d iffe re n t t yp e s ca n ru n co n cu rre n t ly o n s e ve ra l CPUs , b u t t a s kle t s o f t h e s a m e t yp e ca n n o t .

Bo t t o m h a lf

No

Bo t t o m h a lve s ca n n o t ru n co n cu rre n t ly o n s e ve ra l CPUs .

S o ft irq s a n d b o t t o m h a lve s a re s t a t ica lly a llo ca t e d ( i. e . , d e fin e d a t co m p ile t im e ) , wh ile t a s kle t s ca n a ls o b e a llo ca t e d a n d in it ia lize d a t ru n t im e ( fo r in s t a n ce , wh e n lo a d in g a ke rn e l m o d u le ) .

Ma n y s o ft irq s ca n a lwa ys b e e xe cu t e d co n cu rre n t ly o n s e ve ra l CPUs , e ve n if t h e y a re o f t h e s a m e t yp e . Ge n e ra lly s p e a kin g , s o ft irq s a re re - e n t ra n t fu n ct io n s a n d m u s t e xp licit ly p ro t e ct t h e ir d a t a s t ru ct u re s wit h s p in lo cks . Ta s kle t s d iffe r fro m s o ft irq s b e ca u s e a t a s kle t is a lwa ys s e ria lize d wit h re s p e ct t o it s e lf; in o t h e r wo rd s , a t a s kle t ca n n o t b e e xe cu t e d b y t wo CPUs a t t h e s a m e t im e . Ho we ve r, d iffe re n t t a s kle t s ca n b e e xe cu t e d co n cu rre n t ly o n s e ve ra l CPUs . S e ria lizin g t h e t a s kle t s im p lifie s t h e life o f d e vice d rive r d e ve lo p e rs , s in ce t h e t a s kle t fu n ct io n n e e d s n o t t o b e re - e n t ra n t . Fin a lly, b o t t o m h a lve s a re g lo b a lly s e ria lize d . Wh e n o n e b o t t o m h a lf is in e xe cu t io n o n s o m e CPU, t h e o t h e r CPUs ca n n o t e xe cu t e a n y b o t t o m h a lf, e ve n if it is o f d iffe re n t t yp e . Th is is a q u it e s t ro n g lim it a t io n , s in ce it d e g ra d e s t h e p e rfo rm a n ce s o f t h e Lin u x ke rn e l o n m u lt ip ro ce s s o r s ys t e m s . As a m a t t e r o f fa ct , b o t t o m h a lve s co n t in u e t o b e s u p p o rt e d b y t h e ke rn e l fo r co m p a t ib ilit y re a s o n s o n ly, a n d d e vice d rive r d e ve lo p e rs a re e xp e ct e d t o u p d a t e t h e ir o ld d rive rs a n d re p la ce b o t t o m h a lve s wit h t a s kle t s . Th e re fo re , it is like ly t h a t b o t t o m h a lve s will d is a p p e a r in a fu t u re ve rs io n o f Lin u x. In a n y ca s e , d e fe rra b le fu n ct io n s m u s t b e e xe cu t e d s e ria lly. An y d e fe rra b le fu n ct io n ca n n o t b e in t e rle a ve d wit h o t h e r d e fe rra b le fu n ct io n s o n t h e s a m e CPU. Ge n e ra lly s p e a kin g , fo u r kin d s o f o p e ra t io n s ca n b e p e rfo rm e d o n d e fe rra b le fu n ct io n s : In it ia liz a t io n De fin e s a n e w d e fe rra b le fu n ct io n ; t h is o p e ra t io n is u s u a lly d o n e wh e n t h e ke rn e l in it ia lize s it s e lf. Act iv a t io n Ma rks a d e fe rra b le fu n ct io n a s "p e n d in g " — t o b e ru n in t h e n e xt ro u n d o f e xe cu t io n s o f t h e d e fe rra b le fu n ct io n s . Act iva t io n ca n b e d o n e a t a n y t im e ( e ve n wh ile h a n d lin g in t e rru p t s ) . Ma s k in g S e le ct ive ly d is a b le s a d e fe rra b le fu n ct io n in s u ch a wa y t h a t it will n o t b e e xe cu t e d b y t h e ke rn e l e ve n if a ct iva t e d . We 'll s e e in S e ct io n 5 . 3 . 1 1 t h a t d is a b lin g d e fe rra b le fu n ct io n s is s o m e t im e s e s s e n t ia l. Ex e cu t io n Exe cu t e s a p e n d in g d e fe rra b le fu n ct io n t o g e t h e r wit h a ll o t h e r p e n d in g d e fe rra b le fu n ct io n s o f t h e s a m e t yp e ; e xe cu t io n is p e rfo rm e d a t we ll- s p e cifie d t im e s , e xp la in e d la t e r in S e ct io n 4 . 7 . 1 . Act iva t io n a n d e xe cu t io n a re s o m e h o w b o u n d t o g e t h e r: a d e fe rra b le fu n ct io n t h a t h a s b e e n a ct iva t e d b y a g ive n CPU m u s t b e e xe cu t e d o n t h e s a m e CPU. Th e re is n o s e lf- e vid e n t re a s o n s u g g e s t in g t h a t t h is ru le is b e n e ficia l fo r s ys t e m p e rfo rm a n ce s . Bin d in g t h e d e fe rra b le fu n ct io n t o t h e a ct iva t in g CPU co u ld in t h e o ry m a ke b e t t e r u s e o f t h e CPU h a rd wa re ca ch e . Aft e r a ll, it is co n ce iva b le t h a t t h e a ct iva t in g ke rn e l t h re a d a cce s s e s s o m e d a t a s t ru ct u re s t h a t will a ls o b e u s e d b y t h e d e fe rra b le fu n ct io n . Ho we ve r, t h e re le va n t lin e s co u ld e a s ily b e n o lo n g e r in t h e ca ch e wh e n t h e d e fe rra b le fu n ct io n is ru n b e ca u s e it s e xe cu t io n ca n b e d e la ye d a lo n g t im e . Mo re o ve r, b in d in g a fu n ct io n t o a CPU is a lwa ys a

p o t e n t ia lly "d a n g e ro u s " o p e ra t io n , s in ce a CPU m ig h t e n d u p ve ry b u s y wh ile t h e o t h e rs a re m o s t ly id le .

4.7.1 Softirqs Lin u x 2 . 4 u s e s a lim it e d n u m b e r o f s o ft irq s . Fo r m o s t p u rp o s e s , t a s kle t s a re g o o d e n o u g h a n d a re m u ch e a s ie r t o writ e b e ca u s e t h e y d o n o t n e e d t o b e re - e n t ra n t . As a m a t t e r o f fa ct , o n ly t h e fo u r kin d s o f s o ft irq s lis t e d in Ta b le 4 - 7 a re cu rre n t ly d e fin e d .

Ta b le 4 - 7 . S o ft irq s u s e d in Lin u x 2 . 4

S o ft irq

I n d e x ( p rio rit y )

D e s c rip t io n

HI_SOFTIRQ

0

Ha n d le s h ig h - p rio rit y t a s kle t s a n d b o t t o m h a lve s

NET_TX_SOFTIRQ

1

Tra n s m it s p a cke t s t o n e t wo rk ca rd s

NET_RX_SOFTIRQ

2

Re ce ive s p a cke t s fro m n e t wo rk ca rd s

TASKLET_SOFTIRQ 3

Ha n d le s t a s kle t s

Th e in d e x o f a s o firq d e t e rm in e s it s p rio rit y: a lo we r in d e x m e a n s h ig h e r p rio rit y b e ca u s e s o ft irq fu n ct io n s will b e e xe cu t e d s t a rt in g fro m in d e x 0 . Th e m a in d a t a s t ru ct u re u s e d t o re p re s e n t s o ft irq s is t h e softirq_vec a rra y, wh ich in clu d e s 3 2 e le m e n t s o f t yp e softirq_action. Th e p rio rit y o f a s o ft irq is t h e in d e x o f t h e co rre s p o n d in g softirq_action e le m e n t in s id e t h e a rra y. As s h o wn in Ta b le 4 - 7 , o n ly t h e firs t fo u r e n t rie s o f t h e a rra y a re e ffe ct ive ly u s e d . Th e softirq_action d a t a s t ru ct u re co n s is t s o f t wo fie ld s : a p o in t e r t o t h e s o ft irq fu n ct io n a n d a p o in t e r t o a g e n e ric d a t a s t ru ct u re t h a t m a y b e n e e d e d b y t h e s o ft irq fu n ct io n . Th e irq_stat a rra y, a lre a d y in t ro d u ce d in S e ct io n 4 . 6 . 1 . 2 , in clu d e s s e ve ra l fie ld s u s e d b y t h e ke rn e l t o im p le m e n t s s o ft irq s ( a n d a ls o t a s kle t s a n d b o t t o m h a lve s , wh ich d e p e n d o n s o ft irq s ) . Ea ch e le m e n t o f t h e a rra y, co rre s p o n d in g t o a g ive n CPU, in clu d e s : ●

A _ _softirq_pending fie ld t h a t p o in t s t o a softirq_action s t ru ct u re ( t h e p e n d in g s o ft irq ) . Th is fie ld m a y e a s ily b e a cce s s e d t h ro u g h t h e softirq_pending



m a cro . A _ _local_bh_count fie ld t h a t d is a b le s t h e e xe cu t io n o f t h e s o ft irq s ( a s we ll a s t a s kle t s a n d b o t t o m h a lve s ) . Th is fie ld m a y e a s ily b e a cce s s e d t h ro u g h t h e local_bh_count m a cro . If it is s e t t o ze ro , t h e s o ft irq s a re e n a b le d ; a lt e rn a t ive ly, if t h e fie ld s t o re s a p o s it ive in t e g e r, t h e s o ft irq s a re d is a b le d . Th e

local_bh_disable m a cro in cre m e n t s t h e fie ld , wh ile t h e local_bh_enable m a cro d e cre m e n t s it . If t h e ke rn e l in vo ke s local_bh_disable t wice , it m u s t a ls o

ca ll local_bh_enable t wice t o re - e n a b le s o ft irq s . [ 1 4 ]

[14]

Be t t e r n a m e s fo r t h e s e t wo m a cro s co u ld b e local_softirq_disable a n d local_softirq_enable. Th e a ct u a l n a m e s a re ve s t ig e s o f o ld ke rn e l ve rs io n s . ●

A _ _ksoftirqd_task fie ld t h a t s t o re s t h e p ro ce s s d e s crip t o r a d d re s s o f a k s o ft irq d _ CPUn ke rn e l t h re a d , wh ich is d e vo t e d t o t h e e xe cu t io n o f d e fe rra b le fu n ct io n s . ( Th e re is o n e s u ch t h re a d p e r CPU, a n d t h e n in k s o ft iq d _ CPUn re p re s e n t s t h e CPU in d e x, a s d e s crib e d la t e r in S e ct io n 4 . 7 . 1 . 1 . ) Th is fie ld ca n b e a cce s s e d t h ro u g h t h e ksoftirqd_task m a cro .

Th e open_softirq( ) fu n ct io n t a ke s ca re o f s o ft irq in it ia liza t io n . It u s e s t h re e p a ra m e t e rs : t h e s o ft irq in d e x, a p o in t e r t o t h e s o ft irq fu n ct io n t o b e e xe cu t e d , a n d a s e co n d p o in t e r t o a d a t a s t ru ct u re t h a t m a y b e re q u ire d b y t h e s o ft irq fu n ct io n . open_softirq( ) lim it s it s e lf t o in it ia lize t h e p ro p e r e n t ry o f t h e softirq_vec a rra y.

S o ft irq s a re a ct iva t e d b y in vo kin g b y t h e _ _cpu_raise_softirq m a cro , wh ich re ce ive s a s p a ra m e t e rs t h e CPU n u m b e r cpu a n d t h e s o ft irq in d e x nr, a n d s e t s t h e nrt h b it o f

softirq_pending(cpu). Th e cpu_raise_softirq( ) fu n ct io n is s im ila r t o t h e _ _cpu_raise_softirq m a cro , e xce p t t h a t it m ig h t a ls o wa ke u p t h e k s o ft irq d _ CPUn ke rn e l t h re a d . Ch e cks fo r p e n d in g s o ft irq s a re p e rfo rm e d in a fe w p o in t s o f t h e ke rn e l co d e . Cu rre n t ly, t h is is d o n e in t h e fo llo win g ca s e s ( b e wa rn e d t h a t n u m b e r a n d p o s it io n o f t h e s o ft irq ch e ck p o in t s ch a n g e b o t h wit h t h e ke rn e l ve rs io n a n d wit h t h e s u p p o rt e d h a rd wa re a rch it e ct u re ) : ●

Wh e n t h e local_bh_enable m a cro re - e n a b le s t h e s o ft irq s



Wh e n t h e do_IRQ( ) fu n ct io n fin is h e s h a n d lin g a n I/ O in t e rru p t



Wh e n t h e smp_apic_timer_interrupt( ) fu n ct io n fin is h e s h a n d lin g a lo ca l t im e r



in t e rru p t ( s e e S e ct io n 6 . 2 . 2 ) Wh e n o n e o f t h e s p e cia l k s o ft irq d _ CPUn ke rn e l t h re a d s is a wo ke n Wh e n a p a cke t is re ce ive d o n a n e t wo rk in t e rfa ce ca rd ( s e e Ch a p t e r 1 8 )



In e a ch ch e ck p o in t , t h e ke rn e l re a d s softirq_pending(cpu); if t h is fie ld is n o t n u ll, t h e ke rn e l in vo ke s do_softirq( ) t o e xe cu t e t h e s o ft irq fu n ct io n s . It p e rfo rm s t h e fo llo win g a ct io n s : 1 . Ge t s t h e lo g ica l n u m b e r cpu o f t h e CPU t h a t e xe cu t e s t h e fu n ct io n .

2 . Re t u rn s if local_irq_count(cpu) is n o t s e t t o ze ro . In t h is ca s e , do_softirq( ) is in vo ke d wh ile t e rm in a t in g a n e s t e d in t e rru p t h a n d le r, a n d we kn o w t h a t d e fe rra b le fu n ct io n s m u s t ru n o u t s id e o f in t e rru p t s e rvice ro u t in e s . 3 . Re t u rn s if local_bh_count(cpu) is n o t s e t t o ze ro . In t h is ca s e , a ll d e fe rra b le fu n ct io n s a re d is a b le d . 4 . S a ve s t h e s t a t e o f t h e IF fla g a n d cle a rs it t o d is a b le lo ca l in t e rru p t s .

5 . Ch e cks t h e softirq_pending(cpu) fie ld o f irq_stat. If n o s o ft irq s a re p e n d in g , re s t o re s t h e va lu e o f t h e IF fla g s a ve d in t h e p re vio u s s t e p , a n d t h e n re t u rn s . 6 . In vo ke s local_bh_disable(cpu ) t o in cre m e n t t h e local_bh_count(cpu) fie ld o f irq_stat. In t h is wa y, d e fe rra b le fu n ct io n s a re e ffe ct ive ly s e ria lize d o n t h e CPU b e ca u s e a n y fu rt h e r in vo ca t io n o f do_softirq( ) re t u rn s wit h o u t e xe cu t in g t h e s o ft irq fu n ct io n s ( s e e ch e ck a t S t e p 3 ) . 7 . Exe cu t e s t h e fo llo win g lo o p : pending = softirq_pending(cpu); softirq_pending(cpu) = 0; mask = 0; do { mask &= ~pending; asm("sti"); for (i=0; pending; pending >>= 1, i++) if (pending & 1) softirq_vec[i].action(softirq_vec+i); asm("cli"); pending = softirq_pending(cpu); } while (pending & mask);

As yo u m a y s e e , t h e fu n ct io n s t o re s t h e p e n d in g s o ft irq s in t h e pending lo ca l va ria b le , a n d t h e n re s e t s t h e softirq_pending(cpu) fie ld t o ze ro . In e a ch it e ra t io n o f t h e lo o p , t h e fu n ct io n : a . Up d a t e s t h e mask lo ca l va ria b le ; it s t o re s t h e in d ice s o f t h e s o ft irq s t h a t a re a lre a d y e xe cu t e d in t h is in vo ca t io n o f t h e do_softirq( ) fu n ct io n .

b . En a b le s lo ca l in t e rru p t s . c. Exe cu t e s t h e s o ft irq fu n ct io n s o f a ll p e n d in g s o ft irq s ( in n e r lo o p ) . d . Dis a b le s lo ca l in t e rru p t s . e . Re lo a d s t h e pending lo ca l va ria b le wit h t h e co n t e n t s o f t h e

softirq_pending(cpu) fie ld . An in t e rru p t h a n d le r, o r e ve n a s o ft irq fu n ct io n , co u ld h a ve in vo ke d cpu_raise_softirq( ) wh ile s o ft irq fu n ct io n s we re e xe cu t in g . f. Pe rfo rm s a n o t h e r it e ra t io n o f t h e lo o p if a s o ft irq t h a t h a s n o t b e e n h a n d le d in t h is in vo ca t io n o f do_softirq( ) is a ct iva t e d .

8 . De cre m e n t s t h e local_bh_count(cpu) fie ld , t h u s re - e n a b lin g t h e s o ft irq s .

9 . Ch e cks t h e va lu e o f t h e pending lo ca l va ria b le . If it is n o t ze ro , a s o ft irq t h a t wa s h a n d le d in t h is in vo ca t io n o f do_softirq( ) is a ct iva t e d a g a in . To t rig g e r a n o t h e r e xe cu t io n o f t h e do_softirq( ) fu n ct io n , t h e fu n ct io n wa ke s u p t h e

k s o ft irq d _ CPUn ke rn e l t h re a d . 1 0 . Re s t o re s t h e s t a t u s o f IF fla g ( lo ca l in t e rru p t s e n a b le d o r d is a b le d ) s a ve d in S t e p 4 a n d re t u rn s .

4.7.1.1 The softirq kernel threads In re ce n t ke rn e l ve rs io n s , e a ch CPU h a s it s o wn k s o ft irq d _ CPUn ke rn e l t h re a d ( wh e re n is t h e lo g ica l n u m b e r o f t h e CPU) . Ea ch k s o ft irq d _ CPUn ke rn e l t h re a d ru n s t h e ksoftirqd( ) fu n ct io n , wh ich e s s e n t ia lly e xe cu t e s t h e fo llo win g lo o p :

for(;;) { set_current_state(TASK_INTERRUPTIBLE); schedule( ); /* now in TASK_RUNNING state */ while (softirq_pending(cpu)) { do_softirq( ); if (current->need_resched) schedule( ); } } Wh e n a wo ke n , t h e ke rn e l t h re a d ch e cks t h e softirq_pending(n ) fie ld a n d in vo ke s , if n e ce s s a ry, do_softirq( ).

Th e k s o ft irq d _ CPUn ke rn e l t h re a d s re p re s e n t a s o lu t io n fo r a crit ica l t ra d e - o ff p ro b le m . S o ft irq fu n ct io n s m a y re - a ct iva t e t h e m s e lve s ; a ct u a lly, b o t h t h e n e t wo rkin g s o ft irq s a n d t h e t a s kle t s o ft irq s d o t h is . Mo re o ve r, e xt e rn a l e ve n t s , like p a cke t flo o d in g o n a n e t wo rk ca rd , m a y a ct iva t e s o ft irq s a t ve ry h ig h fre q u e n cy. Th e p o t e n t ia l fo r a co n t in u o u s h ig h - vo lu m e flo w o f s o ft irq s cre a t e s a p ro b le m t h a t is s o lve d b y in t ro d u cin g ke rn e l t h re a d s . Wit h o u t t h e m , d e ve lo p e rs a re e s s e n t ia lly fa ce d wit h t wo a lt e rn a t ive s t ra t e g ie s . Th e firs t s t ra t e g y co n s is t s o f ig n o rin g n e w s o ft irq s t h a t o ccu r wh ile do_softirq( ) is ru n n in g . In o t h e r wo rd s , t h e do_softirq( ) fu n ct io n d e t e rm in e s wh a t s o ft irq s a re p e n d in g wh e n t h e fu n ct io n is s t a rt e d , a n d t h e n e xe cu t e s t h e ir fu n ct io n s . Ne xt , it t e rm in a t e s wit h o u t re ch e ckin g t h e p e n d in g s o ft irq s . Th is s o lu t io n is n o t g o o d e n o u g h . S u p p o s e t h a t a s o ft irq fu n ct io n is re - a ct iva t e d d u rin g t h e e xe cu t io n o f do_softirq( ). In t h e wo rs t ca s e , t h e s o ft irq is n o t e xe cu t e d a g a in u n t il t h e n e xt t im e r in t e rru p t , e ve n if t h e m a ch in e is id le . As a re s u lt , s o ft irq la t e n cy t im e is u n a cce p t a b le fo r n e t wo rkin g d e ve lo p e rs . Th e s e co n d s t ra t e g y co n s is t s o f co n t in u o u s ly re ch e ckin g fo r p e n d in g s o ft irq s . Th e do_softirq( ) fu n ct io n ke e p s ch e ckin g t h e p e n d in g s o ft irq s a n d t e rm in a t e s o n ly wh e n n o n e o f t h e m is p e n d in g . Wh ile t h is s o lu t io n m ig h t s a t is fy n e t wo rkin g d e ve lo p e rs , it ca n ce rt a in ly u p s e t n o rm a l u s e rs o f t h e s ys t e m : if a h ig h - fre q u e n cy flo w o f p a cke t s is re ce ive d b y a n e t wo rk ca rd o r a s o ft irq fu n ct io n ke e p s a ct iva t in g it s e lf, t h e do_softirq( ) fu n ct io n n e ve r re t u rn s a n d t h e Us e r Mo d e p ro g ra m s a re virt u a lly s t o p p e d . Th e k s o ft irq d _ CPUn ke rn e l t h re a d s t ry t o s o lve t h is d ifficu lt t ra d e - o ff p ro b le m . Th e

do_softirq( ) fu n ct io n d e t e rm in e s wh a t s o ft irq s a re p e n d in g a n d e xe cu t e s t h e ir fu n ct io n s . If a n a lre a d y e xe cu t e d s o ft irq is a ct iva t e d a g a in , t h e fu n ct io n wa ke s u p t h e ke rn e l t h re a d a n d t e rm in a t e s ( S t e p 9 in o f do_softirq( )) . Th e ke rn e l t h re a d h a s lo w p rio rit y, s o u s e r p ro g ra m s h a ve a ch a n ce t o ru n ; b u t if t h e m a ch in e is id le , t h e p e n d in g s o ft irq s a re e xe cu t e d q u ickly.

4.7.2 Tasklets Ta s kle t s a re t h e p re fe rre d wa y t o im p le m e n t d e fe rra b le fu n ct io n s in I/ O d rive rs . As a lre a d y e xp la in e d , t a s kle t s a re b u ilt o n t o p o f t wo s o ft irq s n a m e d HI_SOFTIRQ a n d

TASKLET_SOFTIRQ. S e ve ra l t a s kle t s m a y b e a s s o cia t e d wit h t h e s a m e s o ft irq , e a ch t a s kle t ca rryin g it s o wn fu n ct io n . Th e re is n o re a l d iffe re n ce b e t we e n t h e t wo s o ft irq s , e xce p t t h a t do_softirq( ) e xe cu t e s HI_SOFTIRQ's t a s kle t s b e fo re TASKLET_SOFTIRQ's t a s kle t s .

Ta s kle t s a n d h ig h - p rio rit y t a s kle t s a re s t o re d in t h e tasklet_vec a n d tasklet_hi_vec a rra ys , re s p e ct ive ly. Bo t h o f t h e m in clu d e NR_CPUS e le m e n t s o f t yp e tasklet_head, a n d e a ch e le m e n t co n s is t s o f a p o in t e r t o a lis t o f t a s k le t d e s crip t o rs . Th e t a s kle t d e s crip t o r is a d a t a s t ru ct u re o f t yp e tasklet_struct, wh o s e fie ld s a re s h o wn in Ta b le 4 - 8 .

Ta b le 4 - 8 . Th e fie ld s o f t h e t a s k le t d e s c rip t o r

Fie ld n a m e

D e s c rip t io n

next

Po in t e r t o n e xt d e s crip t o r in t h e lis t

state

S t a t u s o f t h e t a s kle t

count

Lo ck co u n t e r

func

Po in t e r t o t h e t a s kle t fu n ct io n

data

An u n s ig n e d lo n g in t e g e r t h a t m a y b e u s e d b y t h e t a s kle t fu n ct io n

Th e state fie ld o f t h e t a s kle t d e s crip t o r in clu d e s t wo fla g s :

TASKLET_STATE_SCHED Wh e n s e t , t h is in d ica t e s t h a t t h e t a s kle t is p e n d in g ( h a s b e e n s ch e d u le d fo r e xe cu t io n ) ; it a ls o m e a n s t h a t t h e t a s kle t d e s crip t o r is in s e rt e d in o n e o f t h e lis t s o f t h e tasklet_vec a n d tasklet_hi_vec a rra ys .

TASKLET_STATE_RUN Wh e n s e t , t h is in d ica t e s t h a t t h e t a s kle t is b e in g e xe cu t e d ; o n a u n ip ro ce s s o r s ys t e m

t h is fla g is n o t u s e d b e ca u s e t h e re is n o n e e d t o ch e ck wh e t h e r a s p e cific t a s kle t is ru n n in g . Le t 's s u p p o s e yo u 're writ in g a d e vice d rive r a n d yo u wa n t t o u s e a t a s kle t : wh a t h a s t o b e d o n e ? Firs t o f a ll, yo u s h o u ld a llo ca t e a n e w tasklet_struct d a t a s t ru ct u re a n d in it ia lize it b y in vo kin g tasklet_init( ); t h is fu n ct io n re ce ive s a s p a ra m e t e rs t h e a d d re s s o f t h e t a s kle t d e s crip t o r, t h e a d d re s s o f yo u r t a s kle t fu n ct io n , a n d it s o p t io n a l in t e g e r a rg u m e n t . Yo u r t a s kle t m a y b e s e le ct ive ly d is a b le d b y in vo kin g e it h e r tasklet_disable_nosync( ) o r tasklet_disable( ). Bo t h fu n ct io n s in cre m e n t t h e count fie ld o f t h e t a s kle t d e s crip t o r, b u t t h e la t t e r fu n ct io n d o e s n o t re t u rn u n t il a n a lre a d y ru n n in g in s t a n ce o f t h e t a s kle t fu n ct io n h a s t e rm in a t e d . To re - e n a b le yo u r t a s kle t , u s e tasklet_enable( ).

To a ct iva t e t h e t a s kle t , yo u s h o u ld in vo ke e it h e r t h e tasklet_schedule( ) fu n ct io n o r t h e

tasklet_hi_schedule( ) fu n ct io n , a cco rd in g t o t h e p rio rit y t h a t yo u re q u ire fo r yo u r t a s kle t . Th e t wo fu n ct io n s a re ve ry s im ila r; e a ch o f t h e m p e rfo rm s t h e fo llo win g a ct io n s : 1 . Ch e cks t h e TASKLET_STATE_SCHED fla g ; if it is s e t , re t u rn s ( t h e t a s kle t h a s a lre a d y b e e n s ch e d u le d ) 2 . Ge t s t h e lo g ica l n u m b e r o f t h e CPU t h a t is e xe cu t in g t h e fu n ct io n 3 . S a ve s t h e s t a t e o f t h e IF fla g a n d cle a rs it t o d is a b le lo ca l in t e rru p t s 4 . Ad d s t h e t a s kle t d e s crip t o r a t t h e b e g in n in g o f t h e lis t p o in t e d t o b y tasklet_vec[cpu] o r tasklet_hi_vec[cpu]

5 . In vo ke s cpu_raise_softirq( ) t o a ct iva t e e it h e r t h e TASKLET_SOFTIRQ s o ft irq o r t h e HI_SOFTIRQ s o ft irq

6 . Re s t o re s t h e va lu e o f t h e IF fla g s a ve d in S t e p 3 ( lo ca l in t e rru p t s e n a b le d o r d is a b le d ) Fin a lly, le t 's s e e h o w yo u r t a s kle t is e xe cu t e d . We kn o w fro m t h e p re vio u s s e ct io n t h a t , o n ce a ct iva t e d , s o ft irq fu n ct io n s a re e xe cu t e d b y t h e do_softirq( ) fu n ct io n . Th e s o ft irq fu n ct io n a s s o cia t e d wit h t h e HI_SOFTIRQ s o ft irq is n a m e d tasklet_hi_action( ), wh ile t h e fu n ct io n a s s o cia t e d wit h TASKLET_SOFTIRQ is n a m e d tasklet_action( ). On ce a g a in , t h e t wo fu n ct io n s a re ve ry s im ila r; e a ch o f t h e m : 1 . Ge t s t h e lo g ica l n u m b e r o f t h e CPU t h a t is e xe cu t in g t h e fu n ct io n . 2 . Dis a b le s lo ca l in t e rru p t s , s a vin g t h e p re vio u s s t a t e o f t h e IF fla g . 3 . S t o re s t h e a d d re s s o f t h e lis t p o in t e d t o b y tasklet_vec[cpu] o r

tasklet_hi_vec[cpu] in t h e list lo ca l va ria b le . 4 . Pu t s a NULL a d d re s s in tasklet_vec[cpu] o r tasklet_hi_vec[cpu]; t h u s , t h e lis t o f s ch e d u le d t a s kle t d e s crip t o rs is e m p t ie d .

5 . En a b le s lo ca l in t e rru p t s . 6 . Fo r e a ch t a s kle t d e s crip t o r in t h e lis t p o in t e d t o b y list:

a . In m u lt ip ro ce s s o r s ys t e m s , ch e cks t h e TASKLET_STATE_RUN fla g o f t h e t a s kle t . If it is s e t , a t a s kle t o f t h e s a m e t yp e is a lre a d y ru n n in g o n a n o t h e r CPU, s o t h e fu n ct io n re in s e rt s t h e t a s k d e s crip t o r in t h e lis t p o in t e d t o b y tasklet_vec[cpu] o r tasklet_hi_vec[cpu] a n d a ct iva t e s t h e

TASKLET_SOFTIRQ o r HI_SOFTIRQ s o ft irq a g a in . In t h is wa y, e xe cu t io n o f t h e t a s kle t is d e fe rre d u n t il o t h e r t a s kle t s o f t h e s a m e t yp e a re ru n n in g o n o t h e r CPUs . b . If t h e TASKLET_STATE_RUN fla g is n o t s e t , t h e t a s kle t is n o t ru n n in g o n o t h e r CPUs . In m u lt ip ro ce s s o r s ys t e m s , t h e fu n ct io n s e t s t h e fla g s o t h a t t h e t a s kle t fu n ct io n ca n n o t b e e xe cu t e d o n o t h e r CPUs . c. Ch e cks wh e t h e r t h e t a s kle t is d is a b le d b y lo o kin g a t t h e count fie ld o f t h e t a s kle t d e s crip t o r. If it is d is a b le d , it re in s e rt s t h e t a s k d e s crip t o r in t h e lis t p o in t e d t o b y tasklet_vec[cpu] o r tasklet_hi_vec[cpu]; t h e n t h e fu n ct io n a ct iva t e s t h e TASKLET_SOFTIRQ o r HI_SOFTIRQ s o ft irq a g a in .

d . If t h e t a s kle t is e n a b le d , cle a rs t h e TASKLET_STATE_SCHED fla g a n d e xe cu t e s t h e t a s kle t fu n ct io n . No t ice t h a t , u n le s s t h e t a s kle t fu n ct io n re - a ct iva t e s it s e lf, e ve ry t a s kle t a ct iva t io n t rig g e rs a t m o s t o n e e xe cu t io n o f t h e t a s kle t fu n ct io n .

4.7.3 Bottom Halves A b o t t o m h a lf is e s s e n t ia lly a h ig h - p rio rit y t a s kle t t h a t ca n n o t b e e xe cu t e d co n cu rre n t ly wit h a n y o t h e r b o t t o m h a lf, e ve n if it is o f a d iffe re n t t yp e a n d o n a n o t h e r CPU. Th e global_bh_lock s p in lo ck is u s e d t o e n s u re t h a t a t m o s t o n e b o t t o m h a lf is ru n n in g .

Lin u x u s e s a n a rra y ca lle d t h e bh_base t a b le t o g ro u p a ll b o t t o m h a lve s t o g e t h e r. It is a n a rra y o f p o in t e rs t o b o t t o m h a lve s a n d ca n in clu d e u p t o 3 2 e n t rie s , o n e fo r e a ch t yp e o f b o t t o m h a lf. In p ra ct ice , Lin u x u s e s a b o u t h a lf o f t h e m ; t h e t yp e s a re lis t e d in Ta b le 4 - 9 . As yo u ca n s e e fro m t h e t a b le , s o m e o f t h e b o t t o m h a lve s a re a s s o cia t e d wit h h a rd wa re d e vice s t h a t a re n o t n e ce s s a rily in s t a lle d in t h e s ys t e m o r t h a t a re s p e cific t o p la t fo rm s b e s id e s t h e IBM PC co m p a t ib le . Bu t TIMER_BH, TQUEUE_BH, SERIAL_BH, a n d IMMEDIATE_BH s t ill s e e wid e s p re a d u s e . We d e s crib e t h e TQUEUE_BH a n d IMMEDIATE_BH b o t t o m h a lf la t e r in t h is ch a p t e r a n d t h e TIMER_BH b o t t o m h a lf in Ch a p t e r 6 .

Ta b le 4 - 9 . Th e Lin u x b o t t o m h a lv e s

Bo t t o m h a lf

P e rip h e ra l d e v ic e

TIMER_BH

Tim e r

TQUEUE_BH

Pe rio d ic t a s k q u e u e

DIGI_BH

Dig iBo a rd PC/ Xe

SERIAL_BH

S e ria l p o rt

RISCOM8_BH

RIS Co m / 8

SPECIALIX_BH

S p e cia lix IO8 +

AURORA_BH

Au ro ra m u lt ip o rt ca rd ( S PARC)

ESP_BH

Ha ye s ES P s e ria l ca rd

SCSI_BH

S CS I in t e rfa ce

IMMEDIATE_BH

Im m e d ia t e t a s k q u e u e

CYCLADES_BH

Cycla d e s Cyclo m - Y s e ria l m u lt ip o rt

CM206_BH

CD- ROM Ph ilip s / LMS cm 2 0 6 d is k

MACSERIAL_BH

Po we r Ma cin t o s h 's s e ria l p o rt

ISICOM_BH

Mu lt iTe ch 's IS I ca rd s

Th e bh_task_vec a rra y s t o re s 3 2 t a s kle t d e s crip t o rs , o n e fo r e a ch b o t t o m h a lf. Du rin g ke rn e l in it ia liza t io n , t h e s e t a s kle t d e s crip t o rs a re in it ia lize d in t h e fo llo win g wa y:

for (i=0; iwait); } Th e up( ) fu n ct io n in cre m e n t s t h e count fie ld o f t h e *sem s e m a p h o re ( a t o ffs e t 0 o f t h e

semaphore s t ru ct u re ) , a n d t h e n it ch e cks wh e t h e r it s va lu e is g re a t e r t h a n 0 . Th e in cre m e n t o f count a n d t h e s e t t in g o f t h e fla g t e s t e d b y t h e fo llo win g ju m p in s t ru ct io n m u s t b e a t o m ica lly e xe cu t e d ; o t h e rwis e , a n o t h e r ke rn e l co n t ro l p a t h co u ld co n cu rre n t ly a cce s s t h e fie ld va lu e , wit h d is a s t ro u s re s u lt s . If count is g re a t e r t h a n 0 , t h e re wa s n o p ro ce s s s le e p in g in t h e wa it q u e u e , s o n o t h in g h a s t o b e d o n e . Ot h e rwis e , t h e _ _up( ) fu n ct io n is in vo ke d s o t h a t o n e s le e p in g p ro ce s s is wo ke n u p . Co n ve rs e ly, wh e n a p ro ce s s wis h e s t o a cq u ire a ke rn e l s e m a p h o re lo ck, it in vo ke s t h e down(

) fu n ct io n . Th e im p le m e n t a t io n o f down( ) is q u it e in vo lve d , b u t it is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g :

down: movl $sem,%ecx lock; decl (%ecx); jns 1f pushl %eax pushl %edx pushl %ecx call _ _down popl %ecx popl %edx popl %eax 1: wh e re _ _down( ) is t h e fo llo win g C fu n ct io n :

void _ _down(struct semaphore * sem) { DECLARE_WAITQUEUE(wait, current); current->state = TASK_UNINTERRUPTIBLE; add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); sem->sleepers++; for (;;) { if (!atomic_add_negative(sem->sleepers-1, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; spin_unlock_irq(&semaphore_lock); schedule( ); current->state = TASK_UNINTERRUPTIBLE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); remove_wait_queue(&sem->wait, &wait); current->state = TASK_RUNNING; wake_up(&sem->wait); } Th e down( ) fu n ct io n d e cre m e n t s t h e count fie ld o f t h e *sem s e m a p h o re ( a t o ffs e t 0 o f t h e

semaphore s t ru ct u re ) , a n d t h e n ch e cks wh e t h e r it s va lu e is n e g a t ive . Ag a in , t h e d e cre m e n t a n d t h e t e s t m u s t b e a t o m ica lly e xe cu t e d . If count is g re a t e r t h a n o r e q u a l t o 0 , t h e cu rre n t

p ro ce s s a cq u ire s t h e re s o u rce a n d t h e e xe cu t io n co n t in u e s n o rm a lly. Ot h e rwis e , count is n e g a t ive a n d t h e cu rre n t p ro ce s s m u s t b e s u s p e n d e d . Th e co n t e n t s o f s o m e re g is t e rs a re s a ve d o n t h e s t a ck, a n d t h e n _ _down( ) is in vo ke d .

Es s e n t ia lly, t h e _ _down( ) fu n ct io n ch a n g e s t h e s t a t e o f t h e cu rre n t p ro ce s s fro m

TASK_RUNNING t o TASK_UNINTERRUPTIBLE, a n d p u t s t h e p ro ce s s in t h e s e m a p h o re wa it q u e u e . Be fo re a cce s s in g o t h e r fie ld s o f t h e semaphore s t ru ct u re , t h e fu n ct io n a ls o g e t s t h e semaphore_lock s p in lo ck a n d d is a b le s lo ca l in t e rru p t s . Th is e n s u re s t h a t n o p ro ce s s ru n n in g o n a n o t h e r CPU is a b le t o re a d o r m o d ify t h e fie ld s o f t h e s e m a p h o re wh ile t h e cu rre n t p ro ce s s is u p d a t in g t h e m . Th e m a in t a s k o f t h e _ _down( ) fu n ct io n is t o s u s p e n d t h e cu rre n t p ro ce s s u n t il t h e s e m a p h o re is re le a s e d . Ho we ve r, t h e wa y in wh ich t h is is d o n e is q u it e in vo lve d . To e a s ily u n d e rs t a n d t h e co d e , ke e p in m in d t h a t t h e sleepers fie ld o f t h e s e m a p h o re is u s u a lly s e t t o 0 if n o p ro ce s s is s le e p in g in t h e wa it q u e u e o f t h e s e m a p h o re , a n d it is s e t t o 1 o t h e rwis e . Le t 's t ry t o e xp la in t h e co d e b y co n s id e rin g a fe w t yp ica l ca s e s . MUTEX s e m a p h o re o p e n ( count e q u a l t o 1 , sleepers e q u a l t o 0 )

Th e down m a cro ju s t s e t s t h e count fie ld t o 0 a n d ju m p s t o t h e n e xt in s t ru ct io n o f t h e m a in p ro g ra m ; t h e re fo re , t h e _ _down( ) fu n ct io n is n o t e xe cu t e d a t a ll.

MUTEX s e m a p h o re clo s e d , n o s le e p in g p ro ce s s e s ( count e q u a l t o 0 , sleepers e q u a l t o 0 )

Th e down m a cro d e cre m e n t s count a n d in vo ke s t h e _ _down( ) fu n ct io n wit h t h e

count fie ld s e t t o - 1 a n d t h e sleepers fie ld s e t t o 0 . In e a ch it e ra t io n o f t h e lo o p , t h e fu n ct io n ch e cks wh e t h e r t h e count fie ld is n e g a t ive . ( Ob s e rve t h a t t h e count fie ld is n o t ch a n g e d b y atomic_add_negative( ) b e ca u s e sleepers is e q u a l t o 0 wh e n t h e fu n ct io n is in vo ke d . )



If t h e count fie ld is n e g a t ive , t h e fu n ct io n in vo ke s schedule( ) t o s u s p e n d t h e cu rre n t p ro ce s s . Th e count fie ld is s t ill s e t t o - 1 , a n d t h e sleepers fie ld



t o 1 . Th e p ro ce s s p icks u p it s ru n s u b s e q u e n t ly in s id e t h is lo o p a n d is s u e s t h e t e s t a g a in . If t h e count fie ld is n o t n e g a t ive , t h e fu n ct io n s e t s sleepers t o 0 a n d e xit s fro m t h e lo o p . It t rie s t o wa ke u p a n o t h e r p ro ce s s in t h e s e m a p h o re wa it q u e u e ( b u t in o u r s ce n a rio , t h e q u e u e is n o w e m p t y) , a n d t e rm in a t e s h o ld in g t h e s e m a p h o re . On e xit , b o t h t h e count fie ld a n d t h e sleepers fie ld a re s e t

t o 0 , a s re q u ire d wh e n t h e s e m a p h o re is clo s e d b u t n o p ro ce s s is wa it in g fo r it . MUTEX s e m a p h o re clo s e d , o t h e r s le e p in g p ro ce s s e s ( count e q u a l t o - 1 , sleepers e q u a l t o 1) Th e down m a cro d e cre m e n t s count a n d in vo ke s t h e _ _down( ) fu n ct io n wit h

count s e t t o - 2 a n d sleepers s e t t o 1 . Th e fu n ct io n t e m p o ra rily s e t s sleepers t o 2 , a n d t h e n u n d o e s t h e d e cre m e n t p e rfo rm e d b y t h e down m a cro b y a d d in g t h e va lu e sleepers- 1 t o count. At t h e s a m e t im e , t h e fu n ct io n ch e cks wh e t h e r count is s t ill n e g a t ive ( t h e s e m a p h o re co u ld h a ve b e e n re le a s e d b y t h e h o ld in g p ro ce s s rig h t b e fo re _ _down( ) e n t e re d t h e crit ica l re g io n ) .



If t h e count fie ld is n e g a t ive , t h e fu n ct io n re s e t s sleepers t o 1 a n d in vo ke s



schedule( ) t o s u s p e n d t h e cu rre n t p ro ce s s . Th e count fie ld is s t ill s e t t o 1 , a n d t h e sleepers fie ld t o 1 . If t h e count fie ld is n o t n e g a t ive , t h e fu n ct io n s e t s sleepers t o 0 , t rie s t o wa ke u p a n o t h e r p ro ce s s in t h e s e m a p h o re wa it q u e u e , a n d e xit s h o ld in g t h e s e m a p h o re . On e xit , t h e count fie ld is s e t t o 0 a n d t h e sleepers fie ld t o 0 . Th e va lu e s o f b o t h fie ld s lo o k wro n g , b e ca u s e t h e re a re o t h e r s le e p in g p ro ce s s e s . Ho we ve r, co n s id e r t h a t a n o t h e r p ro ce s s in t h e wa it q u e u e h a s b e e n wo ke n u p . Th is p ro ce s s d o e s a n o t h e r it e ra t io n o f t h e lo o p ; t h e atomic_add_negative( ) fu n ct io n s u b t ra ct s 1 fro m count, re s t o rin g it t o 1 ; m o re o ve r, b e fo re re t u rn in g t o s le e p , t h e wo ke n - u p p ro ce s s re s e t s sleepers t o 1 .

As yo u m a y e a s ily ve rify, t h e co d e p ro p e rly wo rks in a ll ca s e s . Co n s id e r t h a t t h e wake_up(

) fu n ct io n in _ _down( ) wa ke s u p a t m o s t o n e p ro ce s s b e ca u s e t h e s le e p in g p ro ce s s e s in t h e wa it q u e u e a re e xclu s ive ( s e e S e ct io n 3 . 2 . 4 ) . On ly e xce p t io n h a n d le rs , a n d p a rt icu la rly s ys t e m ca ll s e rvice ro u t in e s , ca n u s e t h e down( ) fu n ct io n . In t e rru p t h a n d le rs o r d e fe rra b le fu n ct io n s m u s t n o t in vo ke down( ), s in ce t h is fu n ct io n s u s p e n d s t h e p ro ce s s wh e n t h e s e m a p h o re is b u s y. [ 5 ] Fo r t h is re a s o n , Lin u x p ro vid e s t h e down_trylock( ) fu n ct io n , wh ich m a y b e s a fe ly u s e d b y o n e o f t h e p re vio u s ly m e n t io n e d a s yn ch ro n o u s fu n ct io n s . It is id e n t ica l t o down( ) e xce p t wh e n t h e re s o u rce is b u s y. In t h is ca s e , t h e fu n ct io n re t u rn s im m e d ia t e ly in s t e a d o f p u t t in g t h e p ro ce s s t o s le e p . [5]

Exce p t io n h a n d le rs ca n b lo ck o n a s e m a p h o re . Lin u x t a ke s s p e cia l ca re t o a vo id t h e p a rt icu la r kin d o f ra ce co n d it io n in wh ich t wo n e s t e d ke rn e l co n t ro l p a t h s co m p e t e fo r t h e s a m e s e m a p h o re ; if t h a t h a p p e n s , o n e o f t h e m wa it s fo re ve r b e ca u s e t h e o t h e r ca n n o t ru n a n d fre e t h e s e m a p h o re .

A s lig h t ly d iffe re n t fu n ct io n ca lle d down_interruptible( ) is a ls o d e fin e d . It is wid e ly u s e d b y d e vice d rive rs s in ce it a llo ws p ro ce s s e s t h a t re ce ive a s ig n a l wh ile b e in g b lo cke d o n a s e m a p h o re t o g ive u p t h e "d o wn " o p e ra t io n . If t h e s le e p in g p ro ce s s is wo ke n u p b y a s ig n a l b e fo re g e t t in g t h e n e e d e d re s o u rce , t h e fu n ct io n in cre m e n t s t h e count fie ld o f t h e s e m a p h o re a n d re t u rn s t h e va lu e -EINTR. On t h e o t h e r h a n d , if down_interruptible( ) ru n s t o n o rm a l co m p le t io n a n d g e t s t h e re s o u rce , it re t u rn s 0 . Th e d e vice d rive r m a y t h u s a b o rt t h e I/ O o p e ra t io n wh e n t h e re t u rn va lu e is -EINTR.

Fin a lly, s in ce p ro ce s s e s u s u a lly fin d s e m a p h o re s in a n o p e n s t a t e , t h e s e m a p h o re fu n ct io n s a re o p t im ize d fo r t h is ca s e . In p a rt icu la r, t h e up( ) fu n ct io n d o e s n o t e n t e r in a crit ica l re g io n if t h e s e m a p h o re wa it q u e u e is e m p t y; s im ila rly, t h e down( ) fu n ct io n d o e s n o t e n t e r in a crit ica l re g io n if t h e s e m a p h o re is o p e n . Mu ch o f t h e co m p le xit y o f t h e s e m a p h o re im p le m e n t a t io n is p re cis e ly d u e t o t h e e ffo rt o f a vo id in g co s t ly in s t ru ct io n s in t h e m a in b ra n ch o f t h e e xe cu t io n flo w.

5.3.7 Read/Write Semaphores Re a d / writ e s e m a p h o re s a re a n e w fe a t u re o f Lin u x 2 . 4 . Th e y a re s im ila r t o t h e re a d / writ e

s p in lo cks d e s crib e d e a rlie r in S e ct io n 5 . 3 . 4 , e xce p t t h a t wa it in g p ro ce s s e s a re s u s p e n d e d u n t il t h e s e m a p h o re b e co m e s o p e n a g a in . Ma n y ke rn e l co n t ro l p a t h s m a y co n cu rre n t ly a cq u ire a re a d / writ e s e m a p h o re fo r re a d in g ; h o we ve r, a n y writ e r ke rn e l co n t ro l p a t h m u s t h a ve e xclu s ive a cce s s t o t h e p ro t e ct e d re s o u rce . Th e re fo re , t h e s e m a p h o re ca n b e a cq u ire d fo r writ in g o n ly if n o o t h e r ke rn e l co n t ro l p a t h is h o ld in g it fo r e it h e r re a d o r writ e a cce s s . Re a d / writ e s e m a p h o re s im p ro ve t h e a m o u n t o f co n cu rre n cy in s id e t h e ke rn e l a n d im p ro ve o ve ra ll s ys t e m p e rfo rm a n ce . Th e ke rn e l h a n d le s a ll p ro ce s s e s wa it in g fo r a re a d / writ e s e m a p h o re in s t rict FIFO o rd e r. Ea ch re a d e r o r writ e r t h a t fin d s t h e s e m a p h o re clo s e d is in s e rt e d in t h e la s t p o s it io n o f a s e m a p h o re 's wa it q u e u e lis t . Wh e n t h e s e m a p h o re is re le a s e d , t h e p ro ce s s e s in t h e firs t p o s it io n s o f t h e wa it q u e u e lis t is ch e cke d . Th e firs t p ro ce s s is a lwa ys a wo ke n . If it is a writ e r, t h e o t h e r p ro ce s s e s in t h e wa it q u e u e co n t in u e t o s le e p . If it is a re a d e r, a n y o t h e r re a d e r fo llo win g t h e firs t p ro ce s s is a ls o wo ke n u p a n d g e t s t h e lo ck. Ho we ve r, re a d e rs t h a t h a ve b e e n q u e u e d a ft e r a writ e r co n t in u e t o s le e p . Ea ch re a d / writ e s e m a p h o re is d e s crib e d b y a rw_semaphore s t ru ct u re t h a t in clu d e s t h e fo llo win g fie ld s :

count S t o re s t wo 1 6 - b it co u n t e rs . Th e co u n t e r in t h e m o s t s ig n ifica n t wo rd e n co d e s in t wo 's co m p le m e n t fo rm t h e s u m o f t h e n u m b e r o f n o n wa it in g writ e rs ( e it h e r 0 o r 1 ) a n d t h e n u m b e r o f wa it in g ke rn e l co n t ro l p a t h s . Th e co u n t e r in t h e le s s s ig n ifica n t wo rd e n co d e s t h e t o t a l n u m b e r o f n o n wa it in g re a d e rs a n d writ e rs .

wait_list Po in t s t o a lis t o f wa it in g p ro ce s s e s . Ea ch e le m e n t in t h is lis t is a rwsem_waiter s t ru ct u re , in clu d in g a p o in t e r t o t h e d e s crip t o r o f t h e s le e p in g p ro ce s s a n d a fla g in d ica t in g wh e t h e r t h e p ro ce s s wa n t s t h e s e m a p h o re fo r re a d in g o r fo r writ in g .

wait_lock A s p in lo ck u s e d t o p ro t e ct t h e wa it q u e u e lis t a n d t h e rw_semaphore s t ru ct u re it s e lf. Th e init_rwsem( ) fu n ct io n in it ia lize s a rw_semaphore s t ru ct u re b y s e t t in g t h e count fie ld t o 0 , t h e wait_lock s p in lo ck t o u n lo cke d , a n d wait_list t o t h e e m p t y lis t .

Th e down_read( ) a n d down_write( ) fu n ct io n s a cq u ire t h e re a d / writ e s e m a p h o re fo r re a d in g a n d writ in g , re s p e ct ive ly. S im ila rly, t h e up_read( ) a n d up_write( ) fu n ct io n s re le a s e a re a d / writ e s e m a p h o re p re vio u s ly a cq u ire d fo r re a d in g a n d fo r writ in g . Th e im p le m e n t a t io n o f t h e s e fo u r fu n ct io n s is lo n g , b u t e a s y t o fo llo w b e ca u s e it re s e m b le s t h e im p le m e n t a t io n o f n o rm a l s e m a p h o re s ; t h e re fo re , we a vo id d e s crib in g t h e m .

5.3.8 Completions Lin u x 2 . 4 a ls o m a ke s u s e o f a n o t h e r s yn ch ro n iza t io n p rim it ive s im ila r t o s e m a p h o re s : t h e co m p le t io n s . Th e y h a ve b e e n in t ro d u ce d t o s o lve a s u b t le ra ce co n d it io n t h a t o ccu rs in

m u lt ip ro ce s s o r s ys t e m s wh e n p ro ce s s A a llo ca t e s a t e m p o ra ry s e m a p h o re va ria b le , in it ia lize s it a s clo s e d MUTEX, p a s s e s it s a d d re s s t o p ro ce s s B, a n d t h e n in vo ke s down( ) o n it . La t e r o n , p ro ce s s B ru n n in g o n a d iffe re n t CPU in vo ke s up( ) o n t h e s a m e s e m a p h o re . Ho we ve r, t h e cu rre n t im p le m e n t a t io n o f up( ) a n d down( ) a ls o a llo ws t h e m t o e xe cu t e co n cu rre n t ly o n t h e s a m e s e m a p h o re . Th u s , p ro ce s s A ca n b e wo ke n u p a n d d e s t ro y t h e t e m p o ra ry s e m a p h o re wh ile p ro ce s s B is s t ill e xe cu t in g t h e up( ) fu n ct io n . As a re s u lt , up( ) m ig h t a t t e m p t t o a cce s s a d a t a s t ru ct u re t h a t n o lo n g e r e xis t s . Of co u rs e , it is p o s s ib le t o ch a n g e t h e im p le m e n t a t io n o f down( ) a n d up( ) t o fo rb id co n cu rre n t e xe cu t io n s o n t h e s a m e s e m a p h o re . Ho we ve r, t h is ch a n g e wo u ld re q u ire a d d it io n a l in s t ru ct io n s , wh ich is a b a d t h in g t o d o fo r fu n ct io n s t h a t a re s o h e a vily u s e d . Th e co m p le t io n is a s yn ch ro n iza t io n p rim it ive t h a t is s p e cifica lly d e s ig n e d t o s o lve t h is p ro b le m . Th e completion d a t a s t ru ct u re in clu d e s a wa it q u e u e h e a d a n d a fla g :

struct completion { unsigned int done; wait_queue_head_t wait; }; Th e fu n ct io n co rre s p o n d in g t o up( ) is ca lle d complete( ). It re ce ive s a s a n a rg u m e n t t h e a d d re s s o f a completion d a t a s t ru ct u re , s e t s t h e done fie ld t o 1 , a n d in vo ke s wake_up( ) t o wa ke u p t h e e xclu s ive p ro ce s s s le e p in g in t h e wait wa it q u e u e .

Th e fu n ct io n co rre s p o n d in g t o down( ) is ca lle d wait_for_completion( ). It re ce ive s a s a n a rg u m e n t t h e a d d re s s o f a completion d a t a s t ru ct u re a n d ch e cks t h e va lu e o f t h e done fla g . If it is s e t t o 1 , wait_for_completion( ) t e rm in a t e s b e ca u s e complete( ) h a s a lre a d y b e e n e xe cu t e d o n a n o t h e r CPU. Ot h e rwis e , t h e fu n ct io n a d d s current t o t h e t a il o f t h e wa it q u e u e a s a n e xclu s ive p ro ce s s a n d p u t s current t o s le e p in t h e

TASK_UNINTERRUPTIBLE s t a t e . On ce wo ke n u p , t h e fu n ct io n re m o ve s current fro m t h e wa it q u e u e , s e t s done t o 0 , a n d t e rm in a t e s . Th e re a l d iffe re n ce b e t we e n co m p le t io n s a n d s e m a p h o re s is h o w t h e s p in lo ck in clu d e d in t h e wa it q u e u e is u s e d . Bo t h complete( ) a n d wait_for_completion( ) u s e t h is s p in lo ck t o e n s u re t h a t t h e y ca n n o t e xe cu t e co n cu rre n t ly, wh ile up( ) a n d down( ) u s e it o n ly t o s e ria lize a cce s s e s t o t h e wa it q u e u e lis t .

5.3.9 Local Interrupt Disabling In t e rru p t d is a b lin g is o n e o f t h e ke y m e ch a n is m s u s e d t o e n s u re t h a t a s e q u e n ce o f ke rn e l s t a t e m e n t s is t re a t e d a s a crit ica l s e ct io n . It a llo ws a ke rn e l co n t ro l p a t h t o co n t in u e e xe cu t in g e ve n wh e n h a rd wa re d e vice s is s u e IRQ s ig n a ls , t h u s p ro vid in g a n e ffe ct ive wa y t o p ro t e ct d a t a s t ru ct u re s t h a t a re a ls o a cce s s e d b y in t e rru p t h a n d le rs . By it s e lf, h o we ve r, lo ca l in t e rru p t d is a b lin g d o e s n o t p ro t e ct a g a in s t co n cu rre n t a cce s s e s t o d a t a s t ru ct u re s b y in t e rru p t h a n d le rs ru n n in g o n o t h e r CPUs , s o in m u lt ip ro ce s s o r s ys t e m s , lo ca l in t e rru p t d is a b lin g is o ft e n co u p le d wit h s p in lo cks ( s e e t h e la t e r s e ct io n S e ct io n 5 . 4 ) . In t e rru p t s ca n b e d is a b le d o n a CPU wit h t h e cli a s s e m b ly la n g u a g e in s t ru ct io n , wh ich is yie ld e d b y t h e _ _cli( ) m a cro . In t e rru p t s ca n b e e n a b le d o n a CPU b y m e a n s o f t h e sti a s s e m b ly la n g u a g e in s t ru ct io n , wh ich is yie ld e d b y t h e _ _sti( ) m a cro . Re ce n t ke rn e l

ve rs io n s a ls o d e fin e t h e local_irq_disable( ) a n d local_irq_enable( ) m a cro s , wh ich a re e q u iva le n t re s p e ct ive ly t o _ _cli( ) a n d _ _sti( ), b u t wh o s e n a m e s a re n o t a rch it e ct u re d e p e n d e n t a n d a re a ls o m u ch e a s ie r t o u n d e rs t a n d . Wh e n t h e ke rn e l e n t e rs a crit ica l s e ct io n , it cle a rs t h e IF fla g o f t h e eflags re g is t e r t o d is a b le in t e rru p t s . Bu t a t t h e e n d o f t h e crit ica l s e ct io n , o ft e n t h e ke rn e l ca n 't s im p ly s e t t h e fla g a g a in . In t e rru p t s ca n e xe cu t e in n e s t e d fa s h io n , s o t h e ke rn e l d o e s n o t n e ce s s a rily kn o w wh a t t h e IF fla g wa s b e fo re t h e cu rre n t co n t ro l p a t h e xe cu t e d . In t h e s e ca s e s , t h e co n t ro l p a t h m u s t s a ve t h e o ld s e t t in g o f t h e fla g a n d re s t o re t h a t s e t t in g a t t h e e n d . S a vin g a n d re s t o rin g t h e eflags co n t e n t is a ch ie ve d b y m e a n s o f t h e _ _save_flags a n d

_ _restore_flags m a cro s , re s p e ct ive ly. Typ ica lly, t h e s e m a cro s a re u s e d in t h e fo llo win g wa y:

_ _save_flags(old); _ _cli( ); [...] _ _restore_flags(old); Th e _ _save_flags m a cro co p ie s t h e co n t e n t o f t h e eflags re g is t e r in t o t h e old lo ca l va ria b le ; t h e IF fla g is t h e n cle a re d b y _ _cli( ). At t h e e n d o f t h e crit ica l re g io n , t h e m a cro _ _restore_flags re s t o re s t h e o rig in a l co n t e n t o f eflags; t h e re fo re , in t e rru p t s a re e n a b le d o n ly if t h e y we re e n a b le d b e fo re t h is co n t ro l p a t h is s u e d t h e _ _cli( ) m a cro . Re ce n t ke rn e l ve rs io n s a ls o d e fin e t h e local_irq_save( ) a n d local_irq_restore( ) m a cro s , wh ich a re e s s e n t ia lly e q u iva le n t t o _ _save_flags( ) a n d _ _restore_flags(

), b u t wh o s e n a m e s a re e a s ie r t o u n d e rs t a n d . 5.3.10 Global Interrupt Disabling S o m e crit ica l ke rn e l fu n ct io n s ca n e xe cu t e o n a CPU o n ly if n o in t e rru p t h a n d le r o r d e fe rra b le fu n ct io n is ru n n in g o n a n y o t h e r CPU. Th is s yn ch ro n iza t io n re q u ire m e n t is s a t is fie d b y g lo b a l in t e rru p t d is a b lin g . A t yp ica l s ce n a rio co n s is t s o f a d rive r t h a t n e e d s t o re s e t t h e h a rd wa re d e vice . Be fo re fid d lin g wit h I/ O p o rt s , t h e d rive r p e rfo rm s g lo b a l in t e rru p t d is a b lin g , e n s u rin g t h a t n o o t h e r d rive r will a cce s s t h e s a m e p o rt s . As we s h a ll s e e in t h is s e ct io n , g lo b a l in t e rru p t d is a b lin g s ig n ifica n t ly lo we rs t h e s ys t e m co n cu rre n cy le ve l; it is d e p re ca t e d b e ca u s e it ca n b e re p la ce d b y m o re e fficie n t s yn ch ro n iza t io n t e ch n iq u e s . Glo b a l in t e rru p t d is a b lin g is p e rfo rm e d b y t h e cli( ) m a cro . On u n ip ro ce s s o r s ys t e m , t h e m a cro ju s t e xp a n d s in t o _ _cli( ), d is a b lin g lo ca l in t e rru p t s . On m u lt ip ro ce s s o r s ys t e m s , t h e m a cro wa it s u n t il a ll in t e rru p t h a n d le rs a n d a ll d e fe rra b le fu n ct io n s in t h e o t h e r CPUs t e rm in a t e , a n d t h e n a cq u ire s t h e global_irq_lock s p in lo ck. Th e ke y a ct ivit ie s re q u ire d fo r m u lt ip ro ce s s o r s ys t e m s o ccu r in s id e t h e _ _global_cli( ) fu n ct io n , wh ich is ca lle d b y

cli( ): _ _save_flags(flags); if (!(flags & 0x00000200)) /* testing IF flag */ { _ _cli( ); if (!local_irq_count[smp_processor_id( )])

get_irqlock(smp_processor_id( )); } Firs t o f a ll, _ _global_cli( ) ch e cks t h e va lu e o f t h e IF fla g o f t h e eflags re g is t e r b e ca u s e it re fu s e s t o "p ro m o t e " t h e d is a b lin g o f a lo ca l in t e rru p t t o a g lo b a l o n e . De a d lo ck co n d it io n s ca n e a s ily o ccu r if t h is co n s t ra in t is re m o ve d a n d g lo b a l in t e rru p t s a re d is a b le d in s id e a crit ica l re g io n p ro t e ct e d b y a s p in lo ck. Fo r in s t a n ce , co n s id e r a s p in lo ck t h a t is a ls o a cce s s e d b y in t e rru p t h a n d le rs . Be fo re a cq u irin g t h e s p in lo ck, t h e ke rn e l m u s t d is a b le lo ca l in t e rru p t s , o t h e rwis e a n in t e rru p t h a n d le r co u ld fre e ze wa it in g u n t il t h e in t e rru p t e d p ro g ra m re le a s e d t h e s p in lo ck. No w, s u p p o s e t h a t a ke rn e l co n t ro l p a t h d is a b le s lo ca l in t e rru p t s , a cq u ire s t h e s p in lo ck, a n d t h e n in vo ke s cli( ). Th e la t t e r m a cro wa it s u n t il a ll in t e rru p t h a n d le rs o n t h e o t h e r CPUs t e rm in a t e ; h o we ve r, a n in t e rru p t h a n d le r co u ld b e s t u ck wa it in g fo r t h e s p in lo ck t o b e re le a s e d . To a vo id t h is kin d o f d e a d lo ck, _ _global_cli( ) re fu s e s t o ru n if lo ca l in t e rru p t s a re a lre a d y d is a b le d b e fo re it s in vo ca t io n . If cli( ) is in vo ke d wit h lo ca l in t e rru p t s e n a b le d , _ _global_cli( ) d is a b le s t h e m . If

cli( ) is in vo ke d in s id e a n in t e rru p t s e rvice ro u t in e ( i. e . , local_irq_count m a cro re t u rn s a va lu e d iffe re n t t h a n 0 ) , _ _global_cli( ) re t u rn s wit h o u t p e rfo rm in g a n y fu rt h e r a ct io n . [ 6 ] Ot h e rwis e , _ _global_irq( ) in vo ke s t h e get_irqlock( ) fu n ct io n , wh ich a cq u ire s t h e global_irq_lock s p in lo ck a n d wa it s fo r t h e t e rm in a t io n o f a ll in t e rru p t h a n d le rs ru n n in g o n t h e o t h e r CPUs . Mo re o ve r, if cli( ) is n o t in vo ke d b y a d e fe rra b le fu n ct io n , get_irqlock( ) wa it s fo r t h e t e rm in a t io n o f a ll b o t t o m h a lve s ru n n in g o n t h e o t h e r CPUs . [6]

Th is ca s e s h o u ld n e ve r o ccu r b e ca u s e p ro t e ct io n a g a in s t co n cu rre n t e xe cu t io n o f in t e rru p t h a n d le rs s h o u ld b e b a s e d o n s p in lo cks ra t h e r t h a n o n g lo b a l in t e rru p t d is a b lin g . In s h o rt , a n in t e rru p t s e rvice ro u t in e s h o u ld n e ve r e xe cu t e t h e cli( ) m a cro . global_irq_lock d iffe rs fro m n o rm a l s p in lo cks b e ca u s e in vo kin g get_irqlock( ) d o e s n o t fre e ze t h e CPU if it a lre a d y o wn s t h e lo ck. In fa ct , t h e global_irq _holder va ria b le co n t a in s t h e lo g ica l id e n t ifie r o f t h e CPU t h a t is h o ld in g t h e lo ck; t h is va lu e is ch e cke d b y get_irqlock( ) b e fo re s t a rt in g t h e t ig h t lo o p o f t h e s p in lo ck.

On ce cli( ) re t u rn s , n o o t h e r in t e rru p t h a n d le r o n o t h e r CPUs s t a rt s ru n n in g u n t il in t e rru p t s a re re - e n a b le d b y in vo kin g t h e sti( ) m a cro . On m u lt ip ro ce s s o r s ys t e m s , sti(

) in vo ke s t h e _ _global_sti( ) fu n ct io n : cpu = smp_processor_id( ); if (!local_irq_count[cpu]) release_irqlock(cpu); _ _sti( ); Th e release_irqlock( ) fu n ct io n re le a s e s t h e global_irq_lock s p in lo ck. No t ice t h a t s im ila r t o cli( ), t h e sti( ) m a cro in vo ke d in s id e a n in t e rru p t s e rvice ro u t in e is e q u iva le n t t o _ _sti( ) b e ca u s e it d o e s n 't re le a s e t h e s p in lo ck.

Lin u x a ls o p ro vid e s g lo b a l ve rs io n s o f t h e _ _save_flags a n d _ _restore_flags

m a cro s , wh ich a re a ls o ca lle d save_flags a n d restore_flags. Th e y s a ve a n d re lo a d , re s p e ct ive ly, in fo rm a t io n co n t ro llin g t h e in t e rru p t h a n d lin g fo r t h e e xe cu t in g CPU. As illu s t ra t e d in Fig u re 5 - 3 , save_flags yie ld s a n in t e g e r va lu e t h a t d e p e n d s o n t h re e co n d it io n s ; restore_flags p e rfo rm s a ct io n s b a s e d o n t h e va lu e yie ld e d b y save_flags. Fig u re 5 - 3 . Ac t io n s p e rfo rm e d b y s a v e _ fla g s ( ) a n d re s t o re _ fla g s ( )

Fin a lly, t h e synchronize_irq( ) fu n ct io n is ca lle d wh e n a ke rn e l co n t ro l p a t h wis h e s t o s yn ch ro n ize it s e lf wit h a ll in t e rru p t h a n d le rs :

for (i = 0; i < smp_num_cpus; i++) if (local_irq_count(i)) { cli( ); sti( ); break; } By in vo kin g cli( ), t h e fu n ct io n a cq u ire s t h e global_irq _lock s p in lo ck a n d t h e n wa it s u n t il a ll e xe cu t in g in t e rru p t h a n d le rs t e rm in a t e ; o n ce t h is is d o n e , it re e n a b le s in t e rru p t s . Th e synchronize_irq( ) fu n ct io n is u s u a lly ca lle d b y d e vice d rive rs wh e n t h e y wa n t t o m a ke s u re t h a t a ll a ct ivit ie s ca rrie d o n b y in t e rru p t h a n d le rs a re o ve r.

5.3.11 Disabling Deferrable Functions In S e ct io n 4 . 7 . 1 , we e xp la in e d t h a t d e fe rra b le fu n ct io n s ca n b e e xe cu t e d a t u n p re d ict a b le t im e s ( e s s e n t ia lly, o n t e rm in a t io n o f h a rd wa re in t e rru p t h a n d le rs ) . Th e re fo re , d a t a s t ru ct u re s a cce s s e d b y d e fe rra b le fu n ct io n s m u s t b e p ro t e ct e d a g a in s t ra ce co n d it io n s . A t rivia l wa y t o fo rb id d e fe rra b le fu n ct io n s e xe cu t io n o n a CPU is t o d is a b le in t e rru p t s o n t h a t CPU. S in ce n o in t e rru p t h a n d le r ca n b e a ct iva t e d , s o ft irq a ct io n s ca n n o t b e s t a rt e d a s yn ch ro n o u s ly. Glo b a lly d is a b lin g in t e rru p t s o n a ll CPUs a ls o d is a b le d e fe rra b le fu n ct io n s o n a ll CPUs . In fa ct , re ca ll t h a t t h e do_softirq( ) fu n ct io n re fu s e s t o e xe cu t e t h e s o ft irq s wh e n t h e

global_irq_lock s p in lo ck is clo s e d . Wh e n t h e cli( ) m a cro re t u rn s , t h e in vo kin g ke rn e l co n t ro l p a t h ca n a s s u m e t h a t n o d e fe rra b le fu n ct io n is in e xe cu t io n o n a n y CPU, a n d t h a t n o n e a re s t a rt e d u n t il in t e rru p t s a re g lo b a lly re - e n a b le d . As we s h a ll s e e in t h e n e xt s e ct io n , h o we ve r, t h e ke rn e l s o m e t im e s n e e d s t o d is a b le d e fe rra b le fu n ct io n s wit h o u t d is a b lin g in t e rru p t s . Lo ca l d e fe rra b le fu n ct io n s ca n b e d is a b le d o n e a ch CPU b y s e t t in g t h e _ _ local_bh_count fie ld o f t h e irq_stat s t ru ct u re a s s o cia t e d wit h t h e CPU t o a n o n ze ro va lu e . Re ca ll t h a t t h e do_softirq( ) fu n ct io n n e ve r e xe cu t e s t h e s o ft irq s if it fin d s a n o n ze ro va lu e in t h is fie ld . Mo re o ve r, t a s kle t s a n d b o t t o m h a lve s a re im p le m e n t e d o n t o p o f s o ft irq s , s o writ in g a n o n ze ro va lu e in t h e fie ld d is a b le s t h e e xe cu t io n o f a ll d e fe rra b le fu n ct io n s o n a g ive n CPU, n o t ju s t s o ft irq s . Th e local_bh_disable m a cro in cre m e n t s t h e _ _local_bh_count fie ld b y 1 , wh ile t h e

local_bh_enable m a cro d e cre m e n t s it . Th e ke rn e l ca n t h u s u s e s e ve ra l n e s t e d in vo ca t io n s o f local_bh_disable; d e fe rra b le fu n ct io n s will b e e n a b le d a g a in o n ly b y t h e local_bh_enable m a cro m a t ch in g t h e firs t local_bh_disable in vo ca t io n .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

5.4 Synchronizing Accesses to Kernel Data Structures A s h a re d d a t a s t ru ct u re ca n b e p ro t e ct e d a g a in s t ra ce co n d it io n s b y u s in g s o m e o f t h e s yn ch ro n iza t io n p rim it ive s s h o wn in t h e p re vio u s s e ct io n . Of co u rs e , s ys t e m p e rfo rm a n ce s m a y va ry co n s id e ra b ly, d e p e n d in g o n t h e kin d o f s yn ch ro n iza t io n p rim it ive s e le ct e d . Us u a lly, t h e fo llo win g ru le o f t h u m b is a d o p t e d b y ke rn e l d e ve lo p e rs : a lw a y s k e e p t h e co n cu rre n cy le v e l a s h ig h a s p o s s ib le in t h e s y s t e m . In t u rn , t h e co n cu rre n cy le ve l in t h e s ys t e m d e p e n d s o n t wo m a in fa ct o rs : 1 . Th e n u m b e r o f I/ O d e vice s t h a t o p e ra t e co n cu rre n t ly 2 . Th e n u m b e r o f CPUs t h a t d o p ro d u ct ive wo rk To m a xim ize I/ O t h ro u g h p u t , in t e rru p t s s h o u ld b e d is a b le d fo r ve ry s h o rt p e rio d s o f t im e . As d e s crib e d in S e ct io n 4 . 2 . 1 , wh e n in t e rru p t s a re d is a b le d , IRQs is s u e d b y I/ O d e vice s a re t e m p o ra rily ig n o re d b y t h e PIC a n d n o n e w a ct ivit y ca n s t a rt o n s u ch d e vice s . To u s e CPUs e fficie n t ly, s yn ch ro n iza t io n p rim it ive s b a s e d o n s p in lo cks s h o u ld b e a vo id e d wh e n e ve r p o s s ib le . Wh e n a CPU is e xe cu t in g a t ig h t in s t ru ct io n lo o p wa it in g fo r t h e s p in lo ck t o o p e n , it is wa s t in g p re cio u s m a ch in e cycle s . Le t 's illu s t ra t e a co u p le o f ca s e s in wh ich s yn ch ro n iza t io n ca n b e a ch ie ve d wh ile s t ill m a in t a in in g a h ig h co n cu rre n cy le ve l. ●



A s h a re d d a t a s t ru ct u re co n s is t in g o f a s in g le in t e g e r va lu e ca n b e u p d a t e d b y d e cla rin g it a s a n atomic_t t yp e a n d b y u s in g a t o m ic o p e ra t io n s . An a t o m ic o p e ra t io n is fa s t e r t h a n s p in lo cks a n d in t e rru p t d is a b lin g , a n d it s lo ws d o wn o n ly ke rn e l co n t ro l p a t h s t h a t co n cu rre n t ly a cce s s t h e d a t a s t ru ct u re . In s e rt in g a n e le m e n t in t o a s h a re d lin ke d lis t is n e ve r a t o m ic s in ce it co n s is t s o f a t le a s t t wo p o in t e r a s s ig n m e n t s . Ne ve rt h e le s s , t h e ke rn e l ca n s o m e t im e s p e rfo rm t h is in s e rt io n o p e ra t io n wit h o u t u s in g lo cks o r d is a b lin g in t e rru p t s . As a n e xa m p le o f wh y t h is wo rks , we 'll co n s id e r t h e ca s e wh e re a s ys t e m ca ll s e rvice ro u t in e ( s e e S e ct io n 9 . 2 ) in s e rt s n e w e le m e n t s in a s im p ly lin ke d lis t , wh ile a n in t e rru p t h a n d le r o r d e fe rra b le fu n ct io n a s yn ch ro n o u s ly lo o ks u p t h e lis t . In t h e C la n g u a g e , in s e rt io n is im p le m e n t e d b y m e a n s o f t h e fo llo win g p o in t e r a s s ig n m e n t s :

new->next = list_element->next; list_element->next = new; In a s s e m b ly la n g u a g e , in s e rt io n re d u ce s t o t wo co n s e cu t ive a t o m ic in s t ru ct io n s . Th e firs t in s t ru ct io n s e t s u p t h e next p o in t e r o f t h e new e le m e n t , b u t it d o e s n o t m o d ify t h e lis t . Th u s , if t h e in t e rru p t h a n d le r s e e s t h e lis t b e t we e n t h e e xe cu t io n o f t h e firs t a n d s e co n d in s t ru ct io n s , it s e e s t h e lis t wit h o u t t h e n e w e le m e n t . If t h e h a n d le r s e e s t h e lis t a ft e r t h e e xe cu t io n o f t h e s e co n d in s t ru ct io n , it s e e s t h e lis t wit h t h e n e w e le m e n t . Th e im p o rt a n t p o in t is t h a t in e it h e r ca s e , t h e lis t is co n s is t e n t a n d in a n u n co rru p t e d s t a t e . Ho we ve r, t h is in t e g rit y is a s s u re d o n ly if t h e in t e rru p t h a n d le r d o e s n o t m o d ify t h e lis t . If it d o e s , t h e next p o in t e r t h a t wa s ju s t s e t wit h in t h e new

e le m e n t m ig h t b e co m e in va lid . Ho we ve r, d e ve lo p e rs m u s t e n s u re t h a t t h e o rd e r o f t h e t wo a s s ig n m e n t o p e ra t io n s ca n n o t b e s u b ve rt e d b y t h e co m p ile r o r t h e CPU's co n t ro l u n it ; o t h e rwis e , if t h e s ys t e m ca ll s e rvice ro u t in e is in t e rru p t e d b y t h e in t e rru p t h a n d le r b e t we e n t h e t wo a s s ig n m e n t s , t h e h a n d le r fin d s a co rru p t e d lis t . Th e re fo re , a writ e m e m o ry b a rrie r p rim it ive is re q u ire d :

new->next = list_element->next; wmb( ); list_element->next = new; 5.4.1 Choosing Among Spin Locks, Semaphores, and Interrupt Disabling Un fo rt u n a t e ly, a cce s s p a t t e rn s t o m o s t ke rn e l d a t a s t ru ct u re s a re a lo t m o re co m p le x t h a n t h e s im p le e xa m p le s ju s t s h o wn , a n d ke rn e l d e ve lo p e rs a re fo rce d t o u s e s e m a p h o re s , s p in lo cks , in t e rru p t s , a n d s o ft irq d is a b lin g . Ge n e ra lly s p e a kin g , ch o o s in g t h e s yn ch ro n iza t io n p rim it ive s d e p e n d s o n wh a t kin d s o f ke rn e l co n t ro l p a t h s a cce s s t h e d a t a s t ru ct u re , a s s h o wn in Ta b le 5 - 6 .

Ta b le 5 - 6 . P ro t e c t io n re q u ire d b y d a t a s t ru c t u re s a c c e s s e d b y k e rn e l c o n t ro l p a t h s

Ke rn e l c o n t ro l p a t h s a c c e s s in g t h e d a t a s t ru c t u re

UP p ro t e c t io n

MP fu rt h e r p ro t e c t io n

Exce p t io n s

S e m a p h o re

No n e

In t e rru p t s

Lo ca l in t e rru p t d is a b lin g S p in lo ck

De fe rra b le fu n ct io n s

No n e

Exce p t io n s + In t e rru p t s

Lo ca l in t e rru p t d is a b lin g S p in lo ck

Exce p t io n s + De fe rra b le fu n ct io n s

Lo ca l s o ft irq d is a b lin g

In t e rru p t s + De fe rra b le fu n ct io n s

Lo ca l in t e rru p t d is a b lin g S p in lo ck

Exce p t io n s + In t e rru p t s + De fe rra b le fu n ct io n s

Lo ca l in t e rru p t d is a b lin g S p in lo ck

No n e o r s p in lo ck ( s e e Ta b le 5 - 8 )

S p in lo ck

No t ice t h a t g lo b a l in t e rru p t d is a b lin g d o e s n o t a p p e a r in t h e t a b le . De la yin g in t e rru p t s o n a ll CPUs s ig n ifica n t ly lo we rs t h e s ys t e m co n cu rre n cy le ve l, s o g lo b a l in t e rru p t d is a b lin g is u s u a lly d e p re ca t e d a n d s h o u ld b e re p la ce d b y o t h e r s yn ch ro n iza t io n t e ch n iq u e s . As a m a t t e r o f fa ct , t h is s yn ch ro n iza t io n t e ch n iq u e is s t ill a va ila b le in Lin u x 2 . 4 t o s u p p o rt o ld d e vice d rive rs ; it h a s b e e n re m o ve d fro m t h e Lin u x 2 . 5 cu rre n t d e ve lo p m e n t ve rs io n .

5.4.1.1 Protecting a data structure accessed by exceptions Wh e n a d a t a s t ru ct u re is a cce s s e d o n ly b y e xce p t io n h a n d le rs , ra ce co n d it io n s a re u s u a lly e a s y t o u n d e rs t a n d a n d p re ve n t . Th e m o s t co m m o n e xce p t io n s t h a t g ive ris e t o s yn ch ro n iza t io n p ro b le m s a re t h e s ys t e m ca ll s e rvice ro u t in e s ( s e e S e ct io n 9 . 2 ) in wh ich t h e CPU o p e ra t e s in Ke rn e l Mo d e t o o ffe r a s e rvice t o a Us e r Mo d e p ro g ra m . Th u s , a d a t a s t ru ct u re a cce s s e d o n ly b y a n e xce p t io n u s u a lly re p re s e n t s a re s o u rce t h a t ca n b e a s s ig n e d t o o n e o r m o re p ro ce s s e s . Ra ce co n d it io n s a re a vo id e d t h ro u g h s e m a p h o re s b e ca u s e t h e s e p rim it ive s a llo w t h e p ro ce s s t o s le e p u n t il t h e re s o u rce b e co m e s a va ila b le . No t ice t h a t s e m a p h o re s wo rk e q u a lly we ll b o t h in u n ip ro ce s s o r a n d m u lt ip ro ce s s o r s ys t e m s .

5.4.1.2 Protecting a data structure accessed by interrupts S u p p o s e t h a t a d a t a s t ru ct u re is a cce s s e d b y o n ly t h e "t o p h a lf" o f a n in t e rru p t h a n d le r. We le a rn e d in S e ct io n 4 . 6 t h a t e a ch in t e rru p t h a n d le r is s e ria lize d wit h re s p e ct t o it s e lf — t h a t is , it ca n n o t e xe cu t e m o re t h a n o n ce co n cu rre n t ly. Th u s , a cce s s in g t h e d a t a s t ru ct u re d o e s n o t re q u ire a n y s yn ch ro n iza t io n p rim it ive . Th in g s a re d iffe re n t , h o we ve r, if t h e d a t a s t ru ct u re is a cce s s e d b y s e ve ra l in t e rru p t h a n d le rs . A h a n d le r m a y in t e rru p t a n o t h e r h a n d le r, a n d d iffe re n t in t e rru p t h a n d le rs m a y ru n co n cu rre n t ly in m u lt ip ro ce s s o r s ys t e m s . Wit h o u t s yn ch ro n iza t io n , t h e s h a re d d a t a s t ru ct u re m ig h t e a s ily b e co m e co rru p t e d . In u n ip ro ce s s o r s ys t e m s , ra ce co n d it io n s m u s t b e a vo id e d b y d is a b lin g in t e rru p t s in a ll crit ica l re g io n s o f t h e in t e rru p t h a n d le r. No t h in g le s s will d o b e ca u s e n o o t h e r s yn ch ro n iza t io n p rim it ive s a cco m p lis h t h e jo b . A s e m a p h o re ca n b lo ck t h e p ro ce s s , s o it ca n n o t b e u s e d in a n in t e rru p t h a n d le r. A s p in lo ck, o n t h e o t h e r h a n d , ca n fre e ze t h e s ys t e m : if t h e h a n d le r a cce s s in g t h e d a t a s t ru ct u re is in t e rru p t e d , it ca n n o t re le a s e t h e lo ck; t h e re fo re , t h e n e w in t e rru p t h a n d le r ke e p s wa it in g o n t h e t ig h t lo o p o f t h e s p in lo ck. Mu lt ip ro ce s s o r s ys t e m s , a s u s u a l, a re e ve n m o re d e m a n d in g . Ra ce co n d it io n s ca n n o t b e a vo id e d b y s im p ly d is a b lin g lo ca l in t e rru p t s . In fa ct , e ve n if in t e rru p t s a re d is a b le d o n a CPU, in t e rru p t h a n d le rs ca n s t ill b e e xe cu t e d o n t h e o t h e r CPUs . Th e m o s t co n ve n ie n t m e t h o d t o p re ve n t t h e ra ce co n d it io n s is t o d is a b le lo ca l in t e rru p t s ( s o t h a t o t h e r in t e rru p t h a n d le rs ru n n in g o n t h e s a m e CPU wo n 't in t e rfe re ) a n d t o a cq u ire a s p in lo ck o r a re a d / writ e s p in lo ck t h a t p ro t e ct s t h e d a t a s t ru ct u re . No t ice t h a t t h e s e a d d it io n a l s p in lo cks ca n n o t fre e ze t h e s ys t e m b e ca u s e e ve n if a n in t e rru p t h a n d le r fin d s t h e lo ck clo s e d , e ve n t u a lly t h e in t e rru p t h a n d le r o n t h e o t h e r CPU t h a t o wn s t h e lo ck will re le a s e it . Th e Lin u x ke rn e l u s e s s e ve ra l m a cro s t h a t co u p le lo ca l in t e rru p t s e n a b lin g / d is a b lin g wit h s p in lo ck h a n d lin g . Ta b le 5 - 7 d e s crib e s a ll o f t h e m . In u n ip ro ce s s o r s ys t e m s , t h e s e m a cro s ju s t e n a b le o r d is a b le lo ca l in t e rru p t s b e ca u s e t h e s p in lo ck h a n d lin g m a cro s d o e s n o t h in g .

Ta b le 5 - 7 . I n t e rru p t - a w a re s p in lo c k m a c ro s

Fu n c t io n

D e s c rip t io n

spin_lock_irq(l)

local_irq_disable( ); spin_lock(l)

spin_unlock_irq(l)

spin_unlock(l); local_irq_enable( )

spin_lock_irqsave(l,f)

local_irq_save(f); spin_lock(l)

spin_unlock_irqrestore(l,f)

spin_unlock(l); local_irq_restore(f)

read_lock_irq(l)

local_irq_disable( ); read_lock(l)

read_unlock_irq(l)

read_unlock(l); local_irq_enable( )

write_lock_irq(l)

local_irq_disable( ); write_lock(l)

write_unlock_irq(l)

write_unlock(l); local_irq_enable( )

read_lock_irqsave(l,f)

local_irq_save(f); read_lock(l)

read_unlock_irqrestore(l,f)

read_unlock(l); local_irq_restore(f)

write_lock_irqsave(l,f)

local_irq_save(f); write_lock(l)

write_unlock_irqrestore(l,f)

write_unlock(l); local_irq_restore(f)

5.4.1.3 Protecting a data structure accessed by deferrable functions Wh a t kin d o f p ro t e ct io n is re q u ire d fo r a d a t a s t ru ct u re a cce s s e d o n ly b y d e fe rra b le fu n ct io n s ? We ll, it m o s t ly d e p e n d s o n t h e kin d o f d e fe rra b le fu n ct io n . In S e ct io n 4 . 7 , we e xp la in e d t h a t s o ft irq s , t a s kle t s , a n d b o t t o m h a lve s e s s e n t ia lly d iffe r in t h e ir d e g re e o f co n cu rre n cy. Firs t o f a ll, n o ra ce co n d it io n m a y e xis t in u n ip ro ce s s o r s ys t e m s . Th is is b e ca u s e e xe cu t io n o f d e fe rra b le fu n ct io n s is a lwa ys s e ria lize d o n a CPU — t h a t is , a d e fe rra b le fu n ct io n ca n n o t b e in t e rru p t e d b y a n o t h e r d e fe rra b le fu n ct io n . Th e re fo re , n o s yn ch ro n iza t io n p rim it ive is e ve r re q u ire d . Co n ve rs e ly, in m u lt ip ro ce s s o r s ys t e m s , ra ce co n d it io n s d o e xis t b e ca u s e s e ve ra l d e fe rra b le fu n ct io n s m a y ru n co n cu rre n t ly. Ta b le 5 - 8 lis t s a ll p o s s ib le ca s e s .

Ta b le 5 - 8 . P ro t e c t io n re q u ire d b y d a t a s t ru c t u re s a c c e s s e d b y d e fe rra b le fu n c t io n s in S MP

D e fe rra b le fu n c t io n s a c c e s s in g t h e d a t a s t ru c t u re

P ro t e c t io n

S o ft irq s

S p in lo ck

On e t a s kle t

No n e

Ma n y t a s kle t s

S p in lo ck

Bo t t o m h a lve s

No n e

A d a t a s t ru ct u re a cce s s e d b y a s o ft irq m u s t a lwa ys b e p ro t e ct e d , u s u a lly b y m e a n s o f a s p in lo ck, b e ca u s e t h e s a m e s o ft irq m a y ru n co n cu rre n t ly o n t wo o r m o re CPUs . Co n ve rs e ly, a d a t a s t ru ct u re a cce s s e d b y ju s t o n e kin d o f t a s kle t n e e d n o t b e p ro t e ct e d , b e ca u s e t a s kle t s o f t h e s a m e kin d ca n n o t ru n co n cu rre n t ly. Ho we ve r, if t h e d a t a s t ru ct u re is a cce s s e d b y s e ve ra l kin d s o f t a s kle t s , t h e n it m u s t b e p ro t e ct e d . Fin a lly, a d a t a s t ru ct u re a cce s s e d o n ly b y b o t t o m h a lve s n e e d n o t b e p ro t e ct e d b e ca u s e b o t t o m h a lve s n e ve r ru n co n cu rre n t ly. No t ice t h a t it is a ls o p o s s ib le t o p re ve n t t h e ra ce co n d it io n s b y g lo b a lly d is a b lin g t h e d e fe rra b le fu n ct io n s b y m e a n s o f t h e cli( ) m a cro . Ho we ve r, t h is s h o u ld b e a vo id e d b e ca u s e t h e m a cro a ls o d is a b le s t h e e xe cu t io n o f in t e rru p t h a n d le rs o n a ll CPUs o f t h e s ys t e m .

5.4.1.4 Protecting a data structure accessed by exceptions and interrupts Le t 's co n s id e r n o w a d a t a s t ru ct u re t h a t is a cce s s e d b o t h b y e xce p t io n s ( fo r in s t a n ce , s ys t e m ca ll s e rvice ro u t in e s ) a n d in t e rru p t h a n d le rs . On u n ip ro ce s s o r s ys t e m s , ra ce co n d it io n p re ve n t io n is q u it e s im p le b e ca u s e in t e rru p t h a n d le rs a re n o t re - e n t ra n t a n d ca n n o t b e in t e rru p t e d b y e xce p t io n s . S o lo n g a s t h e ke rn e l a cce s s e s t h e d a t a s t ru ct u re wit h lo ca l in t e rru p t s d is a b le d , t h e ke rn e l ca n n o t b e in t e rru p t e d wh e n a cce s s in g t h e d a t a s t ru ct u re . Ho we ve r, if t h e d a t a s t ru ct u re is a cce s s e d b y ju s t o n e kin d o f in t e rru p t h a n d le r, t h e in t e rru p t h a n d le r ca n fre e ly a cce s s t h e d a t a s t ru ct u re wit h o u t d is a b lin g lo ca l in t e rru p t s . On m u lt ip ro ce s s o r s ys t e m s , we h a ve t o t a ke ca re o f co n cu rre n t e xe cu t io n s o f e xce p t io n s a n d in t e rru p t s o n o t h e r CPUs . Lo ca l in t e rru p t d is a b lin g m u s t b e co u p le d wit h a s p in lo ck, wh ich fo rce s t h e co n cu rre n t ke rn e l co n t ro l p a t h s t o wa it u n t il t h e h a n d le r a cce s s in g t h e d a t a s t ru ct u re fin is h e s it s wo rk. S o m e t im e s it m ig h t b e p re fe ra b le t o re p la ce t h e s p in lo ck wit h a s e m a p h o re . S in ce in t e rru p t h a n d le rs ca n n o t b e s u s p e n d e d , t h e y m u s t a cq u ire t h e s e m a p h o re u s in g a t ig h t lo o p a n d t h e down_trylock( ) fu n ct io n ; fo r t h e m , t h e s e m a p h o re a ct s e s s e n t ia lly a s a s p in lo ck. S ys t e m ca ll s e rvice ro u t in e s , o n t h e o t h e r h a n d , m a y s u s p e n d t h e ca llin g p ro ce s s e s wh e n t h e s e m a p h o re is b u s y. Fo r m o s t s ys t e m ca lls , t h is is t h e e xp e ct e d b e h a vio r; it is p re fe ra b le b e ca u s e it in cre a s e s t h e d e g re e o f co n cu rre n cy o f t h e s ys t e m .

5.4.1.5 Protecting a data structure accessed by exceptions and deferrable functions A d a t a s t ru ct u re a cce s s e d b o t h b y e xce p t io n h a n d le rs a n d d e fe rra b le fu n ct io n s ca n b e

t re a t e d like a d a t a s t ru ct u re a cce s s e d b y e xce p t io n a n d in t e rru p t h a n d le rs . In fa ct , d e fe rra b le fu n ct io n s a re e s s e n t ia lly a ct iva t e d b y in t e rru p t o ccu rre n ce s , a n d n o e xce p t io n ca n b e ra is e d wh ile a d e fe rra b le fu n ct io n is ru n n in g . Co u p lin g lo ca l in t e rru p t d is a b lin g wit h a s p in lo ck is t h e re fo re s u fficie n t . Act u a lly, t h is is m u ch m o re t h a n s u fficie n t : t h e e xce p t io n h a n d le r ca n s im p ly d is a b le d e fe rra b le fu n ct io n s in s t e a d o f lo ca l in t e rru p t s b y u s in g t h e local_bh_disable( ) m a cro ( s e e S e ct io n 4 . 7 . 1 ) . Dis a b lin g o n ly t h e d e fe rra b le fu n ct io n s is p re fe ra b le t o d is a b lin g in t e rru p t s b e ca u s e in t e rru p t s co n t in u e t o b e s e rvice d o n t h e CPU. Exe cu t io n o f d e fe rra b le fu n ct io n s o n e a ch CPU is s e ria lize d , s o n o ra ce co n d it io n e xis t s . As u s u a l, in m u lt ip ro ce s s o r s ys t e m s , s p in lo cks a re re q u ire d t o e n s u re t h a t t h e d a t a s t ru ct u re is a cce s s e d a t a n y t im e b y ju s t o n e ke rn e l co n t ro l p a t h . [ 7 ] [7]

Th e s p in lo ck is re q u ire d e ve n wh e n t h e d a t a s t ru ct u re is a cce s s e d o n ly b y e xce p t io n h a n d le rs a n d b o t t o m h a lve s ( s e e S e ct io n 4 . 7 ) . Th e ke rn e l e n s u re s t h a t t wo b o t t o m h a lve s n e ve r ru n co n cu rre n t ly, b u t t h is is n o t e n o u g h t o p re ve n t ra ce co n d it io n s . 5.4.1.6 Protecting a data structure accessed by interrupts and deferrable functions Th is ca s e is s im ila r t o t h a t o f a d a t a s t ru ct u re a cce s s e d b y in t e rru p t a n d e xce p t io n h a n d le rs . An in t e rru p t m ig h t b e ra is e d wh ile a d e fe rra b le fu n ct io n is ru n n in g , b u t n o d e fe rra b le fu n ct io n ca n s t o p a n in t e rru p t h a n d le r. Th e re fo re , ra ce co n d it io n s m u s t b e a vo id e d b y d is a b lin g lo ca l in t e rru p t s . Ho we ve r, a n in t e rru p t h a n d le r ca n fre e ly t o u ch t h e d a t a s t ru ct u re a cce s s e d b y t h e d e fe rra b le fu n ct io n wit h o u t d is a b lin g in t e rru p t s , p ro vid e d t h a t n o o t h e r in t e rru p t h a n d le r a cce s s e s t h a t d a t a s t ru ct u re . Ag a in , in m u lt ip ro ce s s o r s ys t e m s , a s p in lo ck is a lwa ys re q u ire d t o fo rb id co n cu rre n t a cce s s e s t o t h e d a t a s t ru ct u re o n s e ve ra l CPUs .

5.4.1.7 Protecting a data structure accessed by exceptions, interrupts, and deferrable functions S im ila rly t o p re vio u s ca s e s , d is a b lin g lo ca l in t e rru p t s a n d a cq u irin g a s p in lo ck is a lm o s t a lwa ys n e ce s s a ry t o a vo id ra ce co n d it io n s . No t ice t h a t t h e re is n o n e e d t o e xp licit ly d is a b le d e fe rra b le fu n ct io n s b e ca u s e t h e y a re e s s e n t ia lly a ct iva t e d wh e n t e rm in a t in g t h e e xe cu t io n o f in t e rru p t h a n d le rs ; d is a b lin g lo ca l in t e rru p t s is t h e re fo re s u fficie n t . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

5.5 Examples of Race Condition Prevention Ke rn e l d e ve lo p e rs a re e xp e ct e d t o id e n t ify a n d s o lve t h e s yn ch ro n iza t io n p ro b le m s ra is e d in t e rle a vin g b y ke rn e l co n t ro l p a t h s . Ho we ve r, a vo id in g ra ce co n d it io n s is a h a rd t a s k b e ca u s e it re q u ire s a cle a r u n d e rs t a n d in g o f h o w t h e va rio u s co m p o n e n t s o f t h e ke rn e l in t e ra ct . To g ive a fe e lin g o f wh a t 's re a lly in s id e t h e ke rn e l co d e , le t 's m e n t io n a fe w t yp ica l u s a g e s o f t h e s yn ch ro n iza t io n p rim it ive s d e fin e d in t h is ch a p t e r.

5.5.1 Reference Counters Re fe re n ce co u n t e rs a re wid e ly u s e d in s id e t h e ke rn e l t o a vo id ra ce co n d it io n s d u e t o t h e co n cu rre n t a llo ca t io n a n d re le a s in g o f a re s o u rce . A re fe re n ce co u n t e r is ju s t a n atomic_t co u n t e r a s s o cia t e d wit h a s p e cific re s o u rce like a m e m o ry p a g e , a m o d u le , o r a file . Th e co u n t e r is a t o m ica lly in cre m e n t e d wh e n a ke rn e l co n t ro l p a t h s t a rt s u s in g t h e re s o u rce , a n d it is d e cre m e n t e d wh e n a ke rn e l co n t ro l p a t h fin is h e s u s in g t h e re s o u rce . Wh e n t h e re fe re n ce co u n t e r b e co m e s ze ro , t h e re s o u rce is n o t b e in g u s e d , a n d it ca n b e re le a s e d if n e ce s s a ry.

5.5.2 The Global Kernel Lock In e a rlie r Lin u x ke rn e l ve rs io n s , a g lo b a l k e rn e l lo ck ( a ls o kn o wn a s b ig k e rn e l lo ck , o r BKL) wa s wid e ly u s e d . In Ve rs io n 2 . 0 , t h is lo ck wa s a re la t ive ly cru d e s p in lo ck, e n s u rin g t h a t o n ly o n e p ro ce s s o r a t a t im e co u ld ru n in Ke rn e l Mo d e . Th e 2 . 2 ke rn e l wa s co n s id e ra b ly m o re fle xib le a n d n o lo n g e r re lie d o n a s in g le s p in lo ck; ra t h e r, a la rg e n u m b e r o f ke rn e l d a t a s t ru ct u re s we re p ro t e ct e d b y s p e cia lize d s p in lo cks . Th e g lo b a l ke rn e l lo ck, o n o t h e r h a n d , wa s s t ill p re s e n t b e ca u s e s p lit t in g a b ig lo ck in t o s e ve ra l s m a lle r lo cks is n o t t rivia l — b o t h d e a d lo cks a n d ra ce co n d it io n s m u s t b e ca re fu lly a vo id e d . S e ve ra l u n re la t e d p a rt s o f t h e ke rn e l co d e we re s t ill s e ria lize d b y t h e g lo b a l ke rn e l lo ck. Lin u x ke rn e l Ve rs io n 2 . 4 re d u ce s s t ill fu rt h e r t h e ro le o f t h e g lo b a l ke rn e l lo ck. In t h e cu rre n t s t a b le ve rs io n , t h e g lo b a l ke rn e l lo ck is m o s t ly u s e d t o s e ria lize a cce s s e s t o t h e Virt u a l File S ys t e m a n d a vo id ra ce co n d it io n s wh e n lo a d in g a n d u n lo a d in g ke rn e l m o d u le s . Th e m a in p ro g re s s wit h re s p e ct t o t h e e a rlie r s t a b le ve rs io n is t h a t t h e n e t wo rkin g t ra n s fe rs a n d file a cce s s in g ( like re a d in g o r writ in g in t o a re g u la r file ) a re n o lo n g e r s e ria lize d b y t h e g lo b a l ke rn e l lo ck. Th e g lo b a l ke rn e l lo ck is a s p in lo ck n a m e d kernel_flag. Eve ry p ro ce s s d e s crip t o r in clu d e s a lock_depth fie ld , wh ich a llo ws t h e s a m e p ro ce s s t o a cq u ire t h e g lo b a l ke rn e l lo ck s e ve ra l t im e s . Th e re fo re , t wo co n s e cu t ive re q u e s t s fo r it will n o t h a n g t h e p ro ce s s o r ( a s fo r n o rm a l s p in lo cks ) . If t h e p ro ce s s d o e s n o t wa n t t h e lo ck, t h e fie ld h a s t h e va lu e - 1 . If t h e p ro ce s s wa n t s it , t h e fie ld va lu e p lu s 1 s p e cifie s h o w m a n y t im e s t h e lo ck h a s b e e n re q u e s t e d . Th e lock_depth fie ld is cru cia l fo r in t e rru p t h a n d le rs , e xce p t io n h a n d le rs , a n d b o t t o m h a lve s . Wit h o u t it , a n y a s yn ch ro n o u s fu n ct io n t h a t t rie s t o g e t t h e g lo b a l ke rn e l lo ck co u ld g e n e ra t e a d e a d lo ck if t h e cu rre n t p ro ce s s a lre a d y o wn s t h e lo ck. Th e lock_kernel( ) a n d unlock_kernel( ) fu n ct io n s a re u s e d t o g e t a n d re le a s e t h e g lo b a l ke rn e l lo ck. Th e fo rm e r fu n ct io n is e q u iva le n t t o :

if (++current->lock_depth == 0) spin_lock(&kernel_flag);

wh ile t h e la t t e r is e q u iva le n t t o :

if (--current->lock_depth < 0) spin_unlock(&kernel_flag); No t ice t h a t t h e if s t a t e m e n t s o f t h e lock_kernel( ) a n d unlock_kernel( ) fu n ct io n s n e e d n o t b e e xe cu t e d a t o m ica lly b e ca u s e lock_depth is n o t a g lo b a l va ria b le — e a ch CPU a d d re s s e s a fie ld o f it s o wn cu rre n t p ro ce s s d e s crip t o r. Lo ca l in t e rru p t s in s id e t h e if s t a t e m e n t s d o n o t in d u ce ra ce co n d it io n s e it h e r. Eve n if t h e n e w ke rn e l co n t ro l p a t h in vo ke s lock_kernel( ), it m u s t re le a s e t h e g lo b a l ke rn e l lo ck b e fo re t e rm in a t in g .

5.5.3 Memory Descriptor Read/Write Semaphore Ea ch m e m o ry d e s crip t o r o f t yp e mm_struct in clu d e s it s o wn s e m a p h o re in t h e mmap_sem fie ld ( s e e S e ct io n 8 . 2 ) . Th e s e m a p h o re p ro t e ct s t h e d e s crip t o r a g a in s t ra ce co n d it io n s t h a t co u ld a ris e b e ca u s e a m e m o ry d e s crip t o r ca n b e s h a re d a m o n g s e ve ra l lig h t we ig h t p ro ce s s e s . Fo r in s t a n ce , le t 's s u p p o s e t h a t t h e ke rn e l m u s t cre a t e o r e xt e n d a m e m o ry re g io n fo r s o m e p ro ce s s ; t o d o t h is , it in vo ke s t h e do_mmap( ) fu n ct io n , wh ich a llo ca t e s a n e w

vm_area_struct d a t a s t ru ct u re . In d o in g s o , t h e cu rre n t p ro ce s s co u ld b e s u s p e n d e d if n o fre e m e m o ry is a va ila b le , a n d a n o t h e r p ro ce s s s h a rin g t h e s a m e m e m o ry d e s crip t o r co u ld ru n . Wit h o u t t h e s e m a p h o re , a n y o p e ra t io n o f t h e s e co n d p ro ce s s t h a t re q u ire s a cce s s t o t h e m e m o ry d e s crip t o r ( fo r in s t a n ce , a Pa g e Fa u lt d u e t o a Co p y o n Writ e ) co u ld le a d t o s e ve re d a t a co rru p t io n . Th e s e m a p h o re is im p le m e n t e d a s a re a d / writ e s e m a p h o re b e ca u s e s o m e ke rn e l fu n ct io n s , s u ch a s t h e Pa g e Fa u lt e xce p t io n h a n d le r ( s e e S e ct io n 8 . 4 ) , n e e d o n ly t o s ca n t h e m e m o ry d e s crip t o rs .

5.5.4 Slab Cache List Semaphore Th e lis t o f s la b ca ch e d e s crip t o rs ( s e e S e ct io n 7 . 2 . 2 ) is p ro t e ct e d b y t h e cache_chain_sem s e m a p h o re , wh ich g ra n t s a n e xclu s ive rig h t t o a cce s s a n d m o d ify t h e lis t . A ra ce co n d it io n is p o s s ib le wh e n kmem_cache_create( ) a d d s a n e w e le m e n t in t h e lis t , wh ile kmem_cache_shrink( ) a n d kmem_cache_reap( ) s e q u e n t ia lly s ca n t h e lis t . Ho we ve r, t h e s e fu n ct io n s a re n e ve r in vo ke d wh ile h a n d lin g a n in t e rru p t , a n d t h e y ca n n e ve r b lo ck wh ile a cce s s in g t h e lis t . S in ce t h e ke rn e l is n o n p re e m p t ive , t h e s e fu n ct io n s ca n n o t o ve rla p o n a u n ip ro ce s s o r s ys t e m . Ho we ve r, t h is s e m a p h o re p la ys a n a ct ive ro le in m u lt ip ro ce s s o r s ys t e m s .

5.5.5 Inode Semaphore As we s h a ll s e e in Ch a p t e r 1 2 , Lin u x s t o re s t h e in fo rm a t io n o n a d is k file in a m e m o ry o b je ct ca lle d a n in o d e . Th e co rre s p o n d in g d a t a s t ru ct u re in clu d e s it s o wn s e m a p h o re in t h e i_sem fie ld . A h u g e n u m b e r o f ra ce co n d it io n s ca n o ccu r d u rin g file s ys t e m h a n d lin g . In d e e d , e a ch file o n d is k is a re s o u rce h e ld in co m m o n fo r a ll u s e rs , s in ce a ll p ro ce s s e s m a y ( p o t e n t ia lly) a cce s s

t h e file co n t e n t , ch a n g e it s n a m e o r lo ca t io n , d e s t ro y o r d u p lica t e it , a n d s o o n . Fo r e xa m p le , le t 's s u p p o s e t h a t a p ro ce s s lis t s t h e file s co n t a in e d in s o m e d ire ct o ry. Ea ch d is k o p e ra t io n is p o t e n t ia lly b lo ckin g , a n d t h e re fo re e ve n in u n ip ro ce s s o r s ys t e m s , o t h e r p ro ce s s e s co u ld a cce s s t h e s a m e d ire ct o ry a n d m o d ify it s co n t e n t wh ile t h e firs t p ro ce s s is in t h e m id d le o f t h e lis t in g o p e ra t io n . Or, a g a in , t wo d iffe re n t p ro ce s s e s co u ld m o d ify t h e s a m e d ire ct o ry a t t h e s a m e t im e . All t h e s e ra ce co n d it io n s a re a vo id e d b y p ro t e ct in g t h e d ire ct o ry file wit h t h e in o d e s e m a p h o re . Wh e n e ve r a p ro g ra m u s e s t wo o r m o re s e m a p h o re s , t h e p o t e n t ia l fo r d e a d lo ck is p re s e n t b e ca u s e t wo d iffe re n t p a t h s co u ld e n d u p wa it in g fo r e a ch o t h e r t o re le a s e a s e m a p h o re . Ge n e ra lly s p e a kin g , Lin u x h a s fe w p ro b le m s wit h d e a d lo cks o n s e m a p h o re re q u e s t s , s in ce e a ch ke rn e l co n t ro l p a t h u s u a lly n e e d s t o a cq u ire ju s t o n e s e m a p h o re a t a t im e . Ho we ve r, in a co u p le o f ca s e s , t h e ke rn e l m u s t g e t t wo s e m a p h o re lo cks . In o d e s e m a p h o re s a re p ro n e t o t h is s ce n a rio ; fo r in s t a n ce , t h is o ccu rs in t h e s e rvice ro u t in e s o f t h e rmdir( ) a n d t h e

rename( ) s ys t e m ca lls ( n o t ice t h a t in b o t h ca s e s t wo d iffe re n t in o d e s a re in vo lve d in t h e o p e ra t io n , s o b o t h s e m a p h o re s m u s t b e t a ke n ) . To a vo id s u ch d e a d lo cks , s e m a p h o re re q u e s t s a re p e rfo rm e d in a d d re s s o rd e r: t h e s e m a p h o re re q u e s t wh o s e semaphore d a t a s t ru ct u re is lo ca t e d a t t h e lo we s t a d d re s s is is s u e d firs t .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 6. Timing Measurements Co u n t le s s co m p u t e rize d a ct ivit ie s a re d rive n b y t im in g m e a s u re m e n t s , o ft e n b e h in d t h e u s e r's b a ck. Fo r in s t a n ce , if t h e s cre e n is a u t o m a t ica lly s wit ch e d o ff a ft e r yo u h a ve s t o p p e d u s in g t h e co m p u t e r's co n s o le , it is d u e t o a t im e r t h a t a llo ws t h e ke rn e l t o ke e p t ra ck o f h o w m u ch t im e h a s e la p s e d s in ce yo u p u s h e d a ke y o r m o ve d t h e m o u s e . If yo u re ce ive a wa rn in g fro m t h e s ys t e m a s kin g yo u t o re m o ve a s e t o f u n u s e d file s , it is t h e o u t co m e o f a p ro g ra m t h a t id e n t ifie s a ll u s e r file s t h a t h a ve n o t b e e n a cce s s e d fo r a lo n g t im e . To d o t h e s e t h in g s , p ro g ra m s m u s t b e a b le t o re t rie ve a t im e s t a m p id e n t ifyin g it s la s t a cce s s t im e fro m e a ch file . S u ch a t im e s t a m p m u s t b e a u t o m a t ica lly writ t e n b y t h e ke rn e l. Mo re s ig n ifica n t ly, t im in g d rive s p ro ce s s s wit ch e s a lo n g wit h e ve n m o re vis ib le ke rn e l a ct ivit ie s like ch e ckin g fo r t im e - o u t s . We ca n d is t in g u is h t wo m a in kin d s o f t im in g m e a s u re m e n t t h a t m u s t b e p e rfo rm e d b y t h e Lin u x ke rn e l: ●



Ke e p in g t h e cu rre n t t im e a n d d a t e s o t h e y ca n b e re t u rn e d t o u s e r p ro g ra m s t h ro u g h t h e time( ), ftime( ), a n d gettimeofday( ) s ys t e m ca lls ( s e e S e ct io n 6 . 7 . 1 la t e r in t h is ch a p t e r) a n d u s e d b y t h e ke rn e l it s e lf a s t im e s t a m p s fo r file s a n d n e t wo rk p a cke t s Ma in t a in in g t im e rs — m e ch a n is m s t h a t a re a b le t o n o t ify t h e ke rn e l ( s e e t h e la t e r s e ct io n S e ct io n 6 . 6 ) o r a u s e r p ro g ra m ( s e e t h e la t e r s e ct io n S e ct io n 6 . 7 . 3 ) t h a t a ce rt a in in t e rva l o f t im e h a s e la p s e d

Tim in g m e a s u re m e n t s a re p e rfo rm e d b y s e ve ra l h a rd wa re circu it s b a s e d o n fixe d - fre q u e n cy o s cilla t o rs a n d co u n t e rs . Th is ch a p t e r co n s is t s o f fo u r d iffe re n t p a rt s . Th e firs t t wo s e ct io n s d e s crib e t h e h a rd wa re d e vice s t h a t u n d e rlie t im in g a n d g ive a n o ve ra ll p ict u re o f Lin u x t im e ke e p in g a rch it e ct u re . Th e fo llo win g s e ct io n s d e s crib e t h e m a in t im e - re la t e d d u t ie s o f t h e ke rn e l: im p le m e n t in g CPU t im e s h a rin g , u p d a t in g s ys t e m t im e a n d re s o u rce u s a g e s t a t is t ics , a n d m a in t a in in g s o ft wa re t im e rs . Th e la s t s e ct io n d is cu s s e s t h e s ys t e m ca lls re la t e d t o t im in g m e a s u re m e n t s a n d t h e co rre s p o n d in g s e rvice ro u t in e s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.1 Hardware Clocks On t h e 8 0 x 8 6 a rch it e ct u re , t h e ke rn e l m u s t e xp licit ly in t e ra ct wit h fo u r kin d s o f clo cks : t h e Re a l Tim e Clo ck, t h e Tim e S t a m p Co u n t e r, t h e Pro g ra m m a b le In t e rva l Tim e r, a n d t h e t im e r o f t h e lo ca l APICs in S MP s ys t e m s . Th e firs t t wo h a rd wa re d e vice s a llo w t h e ke rn e l t o ke e p t ra ck o f t h e cu rre n t t im e o f d a y. Th e PIC d e vice a n d t h e t im e rs o f t h e lo ca l APICs a re p ro g ra m m e d b y t h e ke rn e l s o t h a t t h e y is s u e in t e rru p t s a t a fixe d , p re d e fin e d fre q u e n cy; s u ch p e rio d ic in t e rru p t s a re cru cia l fo r im p le m e n t in g t h e t im e rs u s e d b y t h e ke rn e l a n d t h e u s e r p ro g ra m s .

6.1.1 Real Time Clock All PCs in clu d e a clo ck ca lle d Re a l Tim e Clo ck ( RTC ) , wh ich is in d e p e n d e n t o f t h e CPU a n d a ll o t h e r ch ip s . Th e RTC co n t in u e s t o t ick e ve n wh e n t h e PC is s wit ch e d o ff, s in ce it is e n e rg ize d b y a s m a ll b a t t e ry o r a ccu m u la t o r. Th e CMOS RAM a n d RTC a re in t e g ra t e d in a s in g le ch ip ( t h e Mo t o ro la 1 4 6 8 1 8 o r a n e q u iva le n t ) . Th e RTC is ca p a b le o f is s u in g p e rio d ic in t e rru p t s o n IRQ 8 a t fre q u e n cie s ra n g in g b e t we e n 2 Hz a n d 8 , 1 9 2 Hz. It ca n a ls o b e p ro g ra m m e d t o a ct iva t e t h e IRQ 8 lin e wh e n t h e RTC re a ch e s a s p e cific va lu e , t h u s wo rkin g a s a n a la rm clo ck. Lin u x u s e s t h e RTC o n ly t o d e rive t h e t im e a n d d a t e ; h o we ve r, it a llo ws p ro ce s s e s t o p ro g ra m t h e RTC b y a ct in g o n t h e / d e v / rt c d e vice file ( s e e Ch a p t e r 1 3 ) . Th e ke rn e l a cce s s e s t h e RTC t h ro u g h t h e 0x70 a n d 0x71 I/ O p o rt s . Th e s ys t e m a d m in is t ra t o r ca n s e t u p t h e clo ck b y e xe cu t in g t h e clo ck Un ix s ys t e m p ro g ra m t h a t a ct s d ire ct ly o n t h e s e t wo I/ O p o rt s .

6.1.2 Time Stamp Counter All 8 0 x 8 6 m icro p ro ce s s o rs in clu d e a CLK in p u t p in , wh ich re ce ive s t h e clo ck s ig n a l o f a n e xt e rn a l o s cilla t o r. S t a rt in g wit h t h e Pe n t iu m , m a n y re ce n t 8 0 x 8 6 m icro p ro ce s s o rs in clu d e a 6 4 - b it Tim e S t a m p Co u n t e r ( TS C ) re g is t e r t h a t ca n b e re a d b y m e a n s o f t h e rdtsc a s s e m b ly la n g u a g e in s t ru ct io n . Th is re g is t e r is a co u n t e r t h a t is in cre m e n t e d a t e a ch clo ck s ig n a l — if, fo r in s t a n ce , t h e clo ck t icks a t 4 0 0 MHz, t h e Tim e S t a m p Co u n t e r is in cre m e n t e d o n ce e ve ry 2 . 5 n a n o s e co n d s . Lin u x t a ke s a d va n t a g e o f t h is re g is t e r t o g e t m u ch m o re a ccu ra t e t im e m e a s u re m e n t s t h a n t h o s e d e live re d b y t h e Pro g ra m m a b le In t e rva l Tim e r. To d o t h is , Lin u x m u s t d e t e rm in e t h e clo ck s ig n a l fre q u e n cy wh ile in it ia lizin g t h e s ys t e m . In fa ct , s in ce t h is fre q u e n cy is n o t d e cla re d wh e n co m p ilin g t h e ke rn e l, t h e s a m e ke rn e l im a g e m a y ru n o n CPUs wh o s e clo cks m a y t ick a t a n y fre q u e n cy. Th e t a s k o f fig u rin g o u t t h e a ct u a l fre q u e n cy o f a CPU is a cco m p lis h e d d u rin g t h e s ys t e m 's b o o t . Th e calibrate_tsc( ) fu n ct io n co m p u t e s t h e fre q u e n cy b y co u n t in g t h e n u m b e r o f clo ck s ig n a ls t h a t o ccu r in a re la t ive ly lo n g t im e in t e rva l, n a m e ly 5 0 . 0 0 0 7 7 m illis e co n d s . Th is t im e co n s t a n t is p ro d u ce d b y p ro p e rly s e t t in g u p o n e o f t h e ch a n n e ls o f t h e Pro g ra m m a b le In t e rva l Tim e r ( s e e t h e n e xt s e ct io n ) . Th e lo n g e xe cu t io n t im e o f calibrate_tsc( ) d o e s

n o t cre a t e p ro b le m s , s in ce t h e fu n ct io n is in vo ke d o n ly d u rin g s ys t e m in it ia liza t io n . [ 1 ] [1]

To a vo id lo s in g s ig n ifica n t d ig it s in t h e in t e g e r d ivis io n s ,

calibrate_tsc( ) re t u rn s t h e d u ra t io n , in m icro s e co n d s , o f a clo ck t ick m u lt ip lie d b y 2 3 2 . 6.1.3 Programmable Interval Timer Be s id e s t h e Re a l Tim e Clo ck a n d t h e Tim e S t a m p Co u n t e r, IBM- co m p a t ib le PCs in clu d e a n o t h e r t yp e o f t im e - m e a s u rin g d e vice ca lle d Pro g ra m m a b le In t e rv a l Tim e r ( PIT ) . Th e ro le o f a PIT is s im ila r t o t h e a la rm clo ck o f a m icro wa ve o ve n : it m a ke s t h e u s e r a wa re t h a t t h e co o kin g t im e in t e rva l h a s e la p s e d . In s t e a d o f rin g in g a b e ll, t h is d e vice is s u e s a s p e cia l in t e rru p t ca lle d t im e r in t e rru p t , wh ich n o t ifie s t h e ke rn e l t h a t o n e m o re t im e in t e rva l h a s e la p s e d . [ 2 ] An o t h e r d iffe re n ce fro m t h e a la rm clo ck is t h a t t h e PIT g o e s o n is s u in g in t e rru p t s fo re ve r a t s o m e fixe d fre q u e n cy e s t a b lis h e d b y t h e ke rn e l. Ea ch IBM- co m p a t ib le PC in clu d e s a t le a s t o n e PIT, wh ich is u s u a lly im p le m e n t e d b y a 8 2 5 4 CMOS ch ip u s in g t h e 0x40- 0x43 I/ O p o rt s .

[2]

Th e PIT is a ls o u s e d t o d rive a n a u d io a m p lifie r co n n e ct e d t o t h e co m p u t e r's in t e rn a l s p e a ke r.

As we s h a ll s e e in d e t a il in t h e n e xt p a ra g ra p h s , Lin u x p ro g ra m s t h e PIT t o is s u e t im e r in t e rru p t s o n t h e IRQ0 a t a ( ro u g h ly) 1 0 0 - Hz fre q u e n cy — t h a t is , o n ce e ve ry 1 0 m illis e co n d s . Th is t im e in t e rva l is ca lle d a t ick , a n d it s le n g t h in m icro s e co n d s is s t o re d in t h e tick va ria b le . Th e t icks b e a t t im e fo r a ll a ct ivit ie s in t h e s ys t e m ; in s o m e s e n s e , t h e y a re like t h e t icks s o u n d e d b y a m e t ro n o m e wh ile a m u s icia n is re h e a rs in g . Ge n e ra lly s p e a kin g , s h o rt e r t icks re s u lt in h ig h e r re s o lu t io n t im e rs , wh ich h e lp wit h s m o o t h e r m u lt im e d ia p la yb a ck a n d fa s t e r re s p o n s e t im e wh e n p e rfo rm in g s yn ch ro n o u s I/ O m u lt ip le xin g ( poll( ) a n d select( ) s ys t e m ca lls ) . Th is is a t ra d e - o ff h o we ve r: s h o rt e r t icks re q u ire t h e CPU t o s p e n d a la rg e r fra ct io n o f it s t im e in Ke rn e l Mo d e — t h a t is , a s m a lle r fra ct io n o f t im e in Us e r Mo d e . As a co n s e q u e n ce , u s e r p ro g ra m s ru n s lo we r. Th e re fo re , o n ly ve ry p o we rfu l m a ch in e s ca n a d o p t ve ry s h o rt t icks a n d a ffo rd t h e co n s e q u e n t o ve rh e a d . Cu rre n t ly, m o s t He wle t t - Pa cka rd 's Alp h a a n d In t e l's IA- 6 4 p o rt s o f t h e Lin u x ke rn e l is s u e 1 , 0 2 4 t im e r in t e rru p t s p e r s e co n d , co rre s p o n d in g t o a t ick o f ro u g h ly 1 m illis e co n d . Th e Ra wh id e Alp h a s t a t io n a d o p t s t h e h ig h e s t t ick fre q u e n cy a n d is s u e s 1 , 2 0 0 t im e r in t e rru p t s p e r s e co n d . A fe w m a cro s in t h e Lin u x co d e yie ld s o m e co n s t a n t s t h a t d e t e rm in e t h e fre q u e n cy o f t im e r in t e rru p t s . Th e s e a re d is cu s s e d in t h e fo llo win g lis t . ●

HZ yie ld s t h e n u m b e r o f t im e r in t e rru p t s p e r s e co n d — t h a t is , t h e ir fre q u e n cy. Th is



va lu e is s e t t o 1 0 0 fo r IBM PCs a n d m o s t o t h e r h a rd wa re p la t fo rm s . CLOCK_TICK_RATE yie ld s t h e va lu e 1 , 1 9 3 , 1 8 0 , wh ich is t h e 8 2 5 4 ch ip 's in t e rn a l



o s cilla t o r fre q u e n cy. LATCH yie ld s t h e ra t io b e t we e n CLOCK_TICK_RATE a n d HZ. It is u s e d t o p ro g ra m t h e PIT.

Th e firs t PIT is in it ia lize d b y init_IRQ( ) a s fo llo ws :

outb_p(0x34,0x43);

outb_p(LATCH & 0xff , 0x40); outb(LATCH >> 8 , 0x40); Th e outb( ) C fu n ct io n is e q u iva le n t t o t h e outb a s s e m b ly la n g u a g e in s t ru ct io n : it co p ie s t h e firs t o p e ra n d in t o t h e I/ O p o rt s p e cifie d a s t h e s e co n d o p e ra n d . Th e outb_p( ) fu n ct io n is s im ila r t o outb( ), e xce p t t h a t it in t ro d u ce s a p a u s e b y e xe cu t in g a n o - o p in s t ru ct io n . Th e firs t outb_ p( ) in vo ca t io n is a co m m a n d t o t h e PIT t o is s u e in t e rru p t s a t a n e w ra t e . Th e n e xt t wo outb_ p( ) a n d outb( ) in vo ca t io n s s u p p ly t h e n e w in t e rru p t ra t e t o t h e d e vice . Th e 1 6 - b it LATCH co n s t a n t is s e n t t o t h e 8 - b it 0x40 I/ O p o rt o f t h e d e vice a s t wo co n s e cu t ive b yt e s . As a re s u lt , t h e PIT is s u e s t im e r in t e rru p t s a t a ( ro u g h ly) 1 0 0 - Hz fre q u e n cy ( t h a t is , o n ce e ve ry 1 0 m s ) .

6.1.4 CPU Local Timers Th e lo ca l APIC p re s e n t in re ce n t In t e l p ro ce s s o rs ( s e e S e ct io n 4 . 2 ) p ro vid e s ye t a n o t h e r t im e m e a s u rin g d e vice : t h e CPU lo ca l t im e r. Th e CPU lo ca l t im e r is a d e vice t h a t ca n is s u e o n e - s h o t o r p e rio d ic in t e rru p t s , wh ich is s im ila r t o t h e Pro g ra m m a b le In t e rva l Tim e r ju s t d e s crib e d . Th e re a re , h o we ve r, a fe w d iffe re n ce s : ●





Th e APIC's t im e r co u n t e r is 3 2 - b it s lo n g , wh ile t h e PIT's t im e r co u n t e r is 1 6 - b it s lo n g ; t h e re fo re , t h e lo ca l t im e r ca n b e p ro g ra m m e d t o is s u e in t e rru p t s a t ve ry lo w fre q u e n cie s ( t h e co u n t e r s t o re s t h e n u m b e r o f t icks t h a t m u s t e la p s e b e fo re t h e in t e rru p t is is s u e d ) . Th e lo ca l APIC t im e r s e n d s a n in t e rru p t o n ly t o it s p ro ce s s o r, wh ile t h e PIT ra is e s a g lo b a l in t e rru p t , wh ich m a y b e h a n d le d b y a n y CPU in t h e s ys t e m . Th e APIC's t im e r is b a s e d o n t h e b u s clo ck s ig n a l ( o r t h e APIC b u s s ig n a l, in o ld e r m a ch in e s ) . It ca n b e p ro g ra m m e d in s u ch a wa y t o d e cre m e n t t h e t im e r co u n t e r e ve ry 1 , 2 , 4 , 8 , 1 6 , 3 2 , 6 4 , o r 1 2 8 b u s clo ck s ig n a ls . Co n ve rs e ly, t h e PIT h a s it s o wn in t e rn a l clo ck o s cilla t o r.

No w t h a t we u n d e rs t a n d wh a t t h e h a rd wa re t im e rs a re , we m a y d is cu s s h o w t h e Lin u x ke rn e l e xp lo it s t h e m t o co n d u ct a ll a ct ivit ie s o f t h e s ys t e m . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.2 The Linux Timekeeping Architecture Lin u x m u s t ca rry o n s e ve ra l t im e - re la t e d a ct ivit ie s . Fo r in s t a n ce , t h e ke rn e l p e rio d ica lly: ● ● ●

● ●

Up d a t e s t h e t im e e la p s e d s in ce s ys t e m s t a rt u p . Up d a t e s t h e t im e a n d d a t e . De t e rm in e s , fo r e ve ry CPU, h o w lo n g t h e cu rre n t p ro ce s s h a s b e e n ru n n in g , a n d p re e m p t s it if it h a s e xce e d e d t h e t im e a llo ca t e d t o it . Th e a llo ca t io n o f t im e s lo t s ( a ls o ca lle d q u a n t a ) is d is cu s s e d in Ch a p t e r 1 1 . Up d a t e s re s o u rce u s a g e s t a t is t ics . Ch e cks wh e t h e r t h e in t e rva l o f t im e a s s o cia t e d wit h e a ch s o ft wa re t im e r ( s e e t h e la t e r s e ct io n S e ct io n 6 . 6 ) h a s e la p s e d .

Lin u x's t im e k e e p in g a rch it e ct u re is t h e s e t o f ke rn e l d a t a s t ru ct u re s a n d fu n ct io n s re la t e d t o t h e flo w o f t im e . Act u a lly, In t e l- b a s e d m u lt ip ro ce s s o r m a ch in e s h a ve a t im e ke e p in g a rch it e ct u re t h a t is s lig h t ly d iffe re n t fro m t h e t im e ke e p in g a rch it e ct u re o f u n ip ro ce s s o r m a ch in e s : ●



In a u n ip ro ce s s o r s ys t e m , a ll t im e - ke e p in g a ct ivit ie s a re t rig g e re d b y in t e rru p t s ra is e d b y t h e Pro g ra m m a b le In t e rva l Tim e r. In a m u lt ip ro ce s s o r s ys t e m , a ll g e n e ra l a ct ivit ie s ( like h a n d lin g o f s o ft wa re t im e rs ) a re t rig g e re d b y t h e in t e rru p t s ra is e d b y t h e PIT, wh ile CPU- s p e cific a ct ivit ie s ( like m o n it o rin g t h e e xe cu t io n t im e o f t h e cu rre n t ly ru n n in g p ro ce s s ) a re t rig g e re d b y t h e in t e rru p t s ra is e d b y t h e lo ca l APIC t im e rs .

Un fo rt u n a t e ly, t h e d is t in ct io n b e t we e n t h e t wo ca s e s is s o m e wh a t b lu rre d . Fo r in s t a n ce , s o m e e a rly S MP s ys t e m s b a s e d o n In t e l 8 0 4 8 6 p ro ce s s o rs d id n 't h a ve lo ca l APICs . Eve n n o wa d a ys , t h e re a re S MP m o t h e rb o a rd s s o b u g g y t h a t lo ca l t im e r in t e rru p t s a re n o t u s a b le a t a ll. In t h e s e ca s e s , t h e S MP ke rn e l m u s t re s o rt t o t h e UP t im e ke e p in g a rch it e ct u re . On t h e o t h e r h a n d , re ce n t u n ip ro ce s s o r s ys t e m s h a ve a lo ca l APIC a n d a n I/ O APIC, s o t h e ke rn e l m a y u s e t h e S MP t im e ke e p in g a rch it e ct u re . An o t h e r s ig n ifica n t ca s e h o ld s wh e n a S MP- e n a b le d ke rn e l is ru n n in g o n a u n ip ro ce s s o r m a ch in e . Ho we ve r, t o s im p lify o u r d e s crip t io n , we wo n 't d is cu s s t h e s e h yb rid ca s e s a n d will s t ick t o t h e t wo "p u re " t im e ke e p in g a rch it e ct u re s . Lin u x's t im e ke e p in g a rch it e ct u re d e p e n d s a ls o o n t h e a va ila b ilit y o f t h e Tim e S t a m p Co u n t e r ( TS C) . Th e ke rn e l u s e s t wo b a s ic t im e ke e p in g fu n ct io n s : o n e t o ke e p t h e cu rre n t t im e u p t o d a t e a n d a n o t h e r t o co u n t t h e n u m b e r o f m icro s e co n d s t h a t h a ve e la p s e d wit h in t h e cu rre n t s e co n d . Th e re a re t wo d iffe re n t wa ys t o g e t t h e la s t va lu e . On e m e t h o d is m o re p re cis e a n d is a va ila b le if t h e CPU h a s a Tim e S t a m p Co u n t e r; a le s s - p re cis e m e t h o d is u s e d in t h e o p p o s it e ca s e ( s e e t h e la t e r s e ct io n S e ct io n 6 . 7 . 1 ) .

6.2.1 Timekeeping Architecture in Uniprocessor Systems In a u n ip ro ce s s o r s ys t e m , a ll t im e - re la t e d a ct ivit ie s a re t rig g e re d b y t h e in t e rru p t s ra is e d b y t h e Pro g ra m m a b le In t e rva l Tim e r o n IRQ lin e 0 . As u s u a l, in Lin u x, s o m e o f t h e s e a ct ivit ie s a re e xe cu t e d a s s o o n a s p o s s ib le a ft e r t h e in t e rru p t is ra is e d ( in t h e "t o p h a lf" o f t h e in t e rru p t h a n d le r) , wh ile t h e re m a in in g a ct ivit ie s a re d e la ye d ( in t h e "b o t t o m h a lf" o f t h e in t e rru p t h a n d le r) .

6.2.1.1 PIT's interrupt service routine Th e time_init( ) fu n ct io n s e t s u p t h e in t e rru p t g a t e co rre s p o n d in g t o IRQ 0 d u rin g ke rn e l s e t u p . On ce t h is is d o n e , t h e handler fie ld o f IRQ 0 's irqaction d e s crip t o r co n t a in s t h e a d d re s s o f t h e timer_interrupt( ) fu n ct io n . Th is fu n ct io n s t a rt s ru n n in g wit h t h e in t e rru p t s

d is a b le d , s in ce t h e status fie ld o f IRQ 0 's m a in d e s crip t o r h a s t h e SA_INTERRUPT fla g s e t . It p e rfo rm s t h e fo llo win g s t e p s : 1 . If t h e CPU h a s a TS C re g is t e r, it p e rfo rm s t h e fo llo win g s u b s t e p s : a . Exe cu t e s a n rdtsc a s s e m b ly la n g u a g e in s t ru ct io n t o s t o re t h e 3 2 le a s t s ig n ifica n t b it s o f t h e TS C re g is t e r in t h e last_tsc_low va ria b le .

b . Re a d s t h e s t a t e o f t h e 8 2 5 4 ch ip d e vice in t e rn a l o s cilla t o r a n d co m p u t e s t h e d e la y b e t we e n t h e t im e r in t e rru p t o ccu rre n ce a n d t h e e xe cu t io n o f t h e in t e rru p t s e rvice ro u t in e . [ 3 ] [ 3 ] Th e 8 2 5 4 o s cilla t o r d rive s a co u n t e r t h a t is co n t in u o u s ly d e cre m e n t e d . Wh e n t h e co u n t e r b e co m e s 0 , t h e ch ip ra is e s a n IRQ 0 . Th u s , re a d in g t h e co u n t e r in d ica t e s h o w m u ch t im e h a s e la p s e d s in ce t h e in t e rru p t o ccu rre d .

2 . S t o re s t h a t d e la y ( in m icro s e co n d s ) in t h e delay_at_last_interrupt va ria b le ; a s we s h a ll s e e in S e ct io n 6 . 7 . 1 , t h is va ria b le is u s e d t o p ro vid e t h e co rre ct t im e t o u s e r p ro ce s s e s . ●

It in vo ke s do_timer_interrupt( ).

do_timer_interrupt( ), wh ich m a y b e co n s id e re d t h e PIT's in t e rru p t s e rvice ro u t in e co m m o n t o a ll 8 0 x 8 6 m o d e ls , e s s e n t ia lly e xe cu t e s t h e fo llo win g o p e ra t io n s : 1 . It in vo ke s t h e do_timer( ) fu n ct io n , wh ich is fu lly e xp la in e d s h o rt ly.

2 . If t h e t im e r in t e rru p t o ccu rre d in Ke rn e l Mo d e , it in vo ke s t h e x86_do_profile( ) fu n ct io n ( s e e S e ct io n 6 . 5 . 3 la t e r in t h is ch a p t e r) . 3 . If a n adjtimex( ) s ys t e m ca ll is is s u e d , it in vo ke s t h e set_rtc_mmss( ) fu n ct io n o n ce e ve ry 6 6 0 s e co n d s ( e ve ry 1 1 m in u t e s ) t o a d ju s t t h e Re a l Tim e Clo ck. Th is fe a t u re h e lp s s ys t e m s o n a n e t wo rk s yn ch ro n ize t h e ir clo cks ( s e e t h e la t e r s e ct io n S e ct io n 6.7.2). Th e do_timer( ) fu n ct io n , wh ich ru n s wit h t h e in t e rru p t s d is a b le d , m u s t b e e xe cu t e d a s q u ickly a s p o s s ib le . Fo r t h is re a s o n , it s im p ly u p d a t e s o n e fu n d a m e n t a l va lu e —t h e t im e e la p s e d fro m s ys t e m s t a rt u p —a n d ch e cks wh e t h e r t h e ru n n in g p ro ce s s e s h a ve e xh a u s t e d it s t im e q u a n t u m wh ile d e le g a t in g a ll re m a in in g a ct ivit ie s t o t h e TIMER_BH b o t t o m h a lf.

Th e fu n ct io n is e q u iva le n t t o :

void do_timer(struct pt_regs * regs) { jiffies++; update_process_times(user_mode(regs)); /* UP only */ mark_bh(TIMER_BH); if (TQ_ACTIVE(tq_timer)) mark_bh(TQUEUE_BH); }

Th e jiffies g lo b a l va ria b le s t o re s t h e n u m b e r o f e la p s e d t icks s in ce t h e s ys t e m wa s s t a rt e d . It is s e t t o 0 d u rin g ke rn e l in it ia liza t io n a n d in cre m e n t e d b y 1 wh e n a t im e r in t e rru p t o ccu rs — t h a t is , o n e ve ry t ick. S in ce jiffies is a 3 2 - b it u n s ig n e d in t e g e r, it re t u rn s t o 0 a b o u t 4 9 7 d a ys a ft e r t h e s ys t e m h a s b e e n b o o t e d . Ho we ve r, t h e ke rn e l is s m a rt e n o u g h t o h a n d le t h e o ve rflo w wit h o u t g e t t in g co n fu s e d . Th e update_process_times( ) fu n ct io n e s s e n t ia lly ch e cks h o w lo n g t h e cu rre n t p ro ce s s h a s b e e n ru n n in g ; it is d e s crib e d in S e ct io n 6 . 3 la t e r in t h is ch a p t e r. Fin a lly do_timer( ) a ct iva t e s t h e TIMER_BH b o t t o m h a lf; if t h e tq_timer t a s k q u e u e is n o t e m p t y ( s e e S e ct io n 4 . 7 ) , t h e fu n ct io n a ls o a ct iva t e s t h e TQUEUE_BH b o t t o m h a lf.

6.2.1.2 The TIMER_BH bottom half Ea ch in vo ca t io n o f t h e "t o p h a lf" PIT's t im e r in t e rru p t h a n d le r m a rks t h e TIMER_BH b o t t o m h a lf a s a ct ive . As s o o n a s t h e ke rn e l le a ve s in t e rru p t m o d e , t h e timer_bh( ) fu n ct io n , wh ich is a s s o cia t e d wit h TIMER_BH, s t a rt s :

void timer_bh(void) { update_times( ); run_timer_list( ); } Th e update_times( ) fu n ct io n u p d a t e s t h e s ys t e m d a t e a n d t im e a n d co m p u t e s t h e cu rre n t s ys t e m lo a d ; t h e s e a ct ivit ie s a re d is cu s s e d la t e r in S e ct io n 6 . 4 a n d S e ct io n 6 . 5 . Th e

run_timer_list( ) fu n ct io n t a ke s ca re o f s o ft wa re t im e rs h a n d lin g ; it is d is cu s s e d in t h e la t e r s e ct io n S e ct io n 6 . 6 .

6.2.2 Timekeeping Architecture in Multiprocessor Systems In m u lt ip ro ce s s o r s ys t e m s , t im e r in t e rru p t s ra is e d b y t h e Pro g ra m m a b le In t e rva l Tim e r s t ill p la y a n im p o rt a n t ro le . In d e e d , t h e co rre s p o n d in g in t e rru p t h a n d le r t a ke s ca re o f a ct ivit ie s n o t re la t e d t o a s p e cific CPU, s u ch a s t h e h a n d lin g o f s o ft wa re t im e rs a n d ke e p in g t h e s ys t e m t im e u p t o d a t e . As in t h e u n ip ro ce s s o r ca s e , t h e m o s t u rg e n t a ct ivit ie s a re p e rfo rm e d b y t h e "t o p h a lf" o f t h e in t e rru p t h a n d le r ( s e e S e ct io n 6 . 2 . 1 . 1 e a rlie r in t h is ch a p t e r) , wh ile t h e re m a in in g a ct ivit ie s a re d e la ye d u n t il t h e e xe cu t io n o f t h e TIMER_BH b o t t o m h a lf ( s e e t h e e a rlie r s e ct io n S e ct io n 6 . 2 . 1 . 2 ) . Ho we ve r, t h e S MP ve rs io n o f t h e PIT's in t e rru p t s e rvice ro u t in e d iffe rs fro m t h e UP ve rs io n in a fe w p o in t s : ●

Th e timer_interrupt( ) fu n ct io n a cq u ire s t h e xtime_lock re a d / writ e s p in lo ck fo r writ in g . Alt h o u g h lo ca l in t e rru p t s a re d is a b le d , t h e ke rn e l m u s t p ro t e ct t h e xtime,

last_tsc_low, a n d delay_at_last_interrupt g lo b a l va ria b le s fro m co n cu rre n t re a d a n d writ e a cce s s e s p e rfo rm e d b y o t h e r CPUs ( s e e S e ct io n 6 . 4 la t e r in t h is ch a p t e r) . ●

Th e do_timer_interrupt( ) fu n ct io n d o e s n o t in vo ke t h e x86_do_profile( )



fu n ct io n b e ca u s e t h is fu n ct io n p e rfo rm s a ct io n s re la t e d t o a s p e cific CPU. Th e do_timer( ) fu n ct io n d o e s n o t in vo ke update_process_times( ) b e ca u s e t h is fu n ct io n a ls o p e rfo rm s a ct io n s re la t e d t o a s p e cific CPU.

Th e re a re t wo t im e ke e p in g a ct ivit ie s re la t e d t o e ve ry s p e cific CPU in t h e s ys t e m : ● ●

Mo n it o rin g h o w m u ch t im e t h e cu rre n t p ro ce s s h a s b e e n ru n n in g o n t h e CPU Up d a t in g t h e re s o u rce u s a g e s t a t is t ics o f t h e CPU

To s im p lify t h e o ve ra ll t im e ke e p in g a rch it e ct u re , in Lin u x 2 . 4 , e ve ry CPU t a ke s ca re o f t h e s e a ct ivit ie s in t h e h a n d le r o f t h e lo ca l t im e r in t e rru p t ra is e d b y t h e APIC d e vice e m b e d d e d in t h e CPU. In t h is wa y, t h e n u m b e r o f a cce s s e d s p in lo cks is m in im ize d , s in ce e ve ry CPU t e n d s t o a cce s s o n ly it s o wn "p riva t e " d a t a s t ru ct u re s .

6.2.2.1 Initialization of the timekeeping architecture Du rin g ke rn e l in it ia liza t io n , e a ch APIC h a s t o b e t o ld h o w o ft e n t o g e n e ra t e a lo ca l t im e in t e rru p t . Th e setup_APIC_clocks( ) fu n ct io n p ro g ra m s t h e lo ca l APICs o f a ll CPUS t o g e n e ra t e in t e rru p t s a s fo llo ws :

void setup_APIC_clocks (void) { _ _cli( ); calibration_result = calibrate_APIC_clock( ); setup_APIC_timer((void *)calibration_result); _ _sti( ); smp_call_function(setup_APIC_timer, (void *)calibration_result, 1, 1); } Th e calibrate_APIC_clock( ) fu n ct io n co m p u t e s h o w m a n y lo ca l t im e r in t e rru p t s a re g e n e ra t e d b y t h e lo ca l APIC o f t h e b o o t in g CPU d u rin g a t ick ( 1 0 m s ) . Th is e xa ct va lu e is t h e n u s e d t o p ro g ra m t h e lo ca l APICs in s u ch a wa y t o g e n e ra t e o n e lo ca l t im e r in t e rru p t e ve ry t ick. Th is is d o n e b y t h e setup_APIC_timer( ) fu n ct io n , wh ich is in vo ke d d ire ct ly o n t h e b o o t in g CPU, a n d t h ro u g h t h e CALL_FUNCTION_VECTOR In t e rp ro ce s s o r In t e rru p t s ( IPI) o n t h e o t h e r CPUs ( s e e S e ct io n 4 . 6 . 2 ) . All lo ca l APIC t im e rs a re s yn ch ro n ize d b e ca u s e t h e y a re b a s e d o n t h e co m m o n b u s clo ck s ig n a l. Th is m e a n s t h a t t h e va lu e co m p u t e d b y calibrate_APIC_clock( ) fo r t h e b o o t in g CPU is g o o d a ls o fo r t h e o t h e r CPUs in t h e s ys t e m . Ho we ve r, we d o n 't re a lly wa n t t o h a ve a ll lo ca l t im e r in t e rru p t s g e n e ra t e d a t e xa ct ly t h e s a m e t im e b e ca u s e t h is co u ld in d u ce a s u b s t a n t ia l p e rfo rm a n ce p e n a lt y d u e t o wa it s o n s p in lo cks . Fo r t h e s a m e re a s o n , a lo ca l t im e r in t e rru p t h a n d le r s h o u ld n o t ru n o n a CPU wh e n a PIT's t im e r in t e rru p t h a n d le r is b e in g e xe cu t e d o n a n o t h e r CPU. Th e re fo re , t h e setup_APIC_timer( ) fu n ct io n s p re a d s t h e lo ca l t im e r in t e rru p t s in s id e t h e t ick in t e rva l. Fig u re 6 - 1 s h o ws a n e xa m p le . In a m u lt ip ro ce s s o r s ys t e m s wit h fo u r CPUs , t h e b e g in n in g o f t h e t ick is m a rke d b y t h e PIT's t im e r in t e rru p t . Two m illis e co n d s a ft e r t h e PIT's t im e r in t e rru p t , t h e lo ca l APIC o f CPU 0 ra is e s it s lo ca l t im e r in t e rru p t ; t wo m illis e co n d s la t e r, it is t h e t u rn o f t h e lo ca l APIC o f CPU 1 , a n d s o o n . Two m illis e co n d s a ft e r t h e lo ca l t im e r in t e rru p t o f CPU 3 , t h e PIT ra is e s a n o t h e r t im e r in t e rru p t o n IRQ 0 lin e a n d s t a rt s a n e w t ick. Fig u re 6 - 1 . S p re a d in g lo c a l t im e r in t e rru p t s in s id e a t ic k

setup_APIC_timer( ) p ro g ra m s t h e lo ca l APIC in s u ch a wa y t o ra is e t im e r in t e rru p t s t h a t h a ve ve ct o r LOCAL_TIMER_VECTOR ( u s u a lly, 0xef) ; m o re o ve r, t h e init_IRQ( ) fu n ct io n a s s o cia t e s LOCAL_TIMER_VECTOR t o t h e lo w- le ve l in t e rru p t h a n d le r apic_timer_interrupt( ). 6.2.2.2 The local timer interrupt handler Th e apic_timer_interrupt( ) a s s e m b ly la n g u a g e fu n ct io n is e q u iva le n t t o t h e fo llo win g co d e :

apic_timer_interrupt: pushl $LOCAL_TIMER_VECTOR-256 SAVE_ALL movl %esp,%eax pushl %eax call smp_apic_timer_interrupt addl $4,%esp jmp ret_from_intr As yo u ca n s e e , t h e lo w- le ve l h a n d le r is ve ry s im ila r t o t h e o t h e r lo w- le ve l in t e rru p t h a n d le rs a lre a d y d e s crib e d in Ch a p t e r 4 . Th e h ig h - le ve l in t e rru p t h a n d le r ca lle d

smp_apic_timer_interrupt( ) e xe cu t e s t h e fo llo win g s t e p s : 1 . Ge t s t h e CPU lo g ica l n u m b e r ( s a y n ) 2 . In cre m e n t s t h e n t h e n t ry o f t h e apic_timer_irqs a rra y b y 1 ( s e e S e ct io n 6 . 5 . 4 la t e r in t h is ch a p t e r) 3 . Ackn o wle d g e s t h e in t e rru p t o n t h e lo ca l APIC 4 . Ca lls t h e irq_enter( ) fu n ct io n t o in cre m e n t t h e n t h e n t ry o f t h e local_irq_count a rra y a n d t o h o n o r t h e global_irq_lock s p in lo ck ( s e e Ch a p t e r 5 )

5 . In vo ke s t h e smp_local_timer_interrupt( ) fu n ct io n

6 . Ca lls t h e irq_exit( ) fu n ct io n t o d e cre m e n t t h e n t h e n t ry o f t h e local_irq_count a rra y 7 . In vo ke s do_softirq( ) if s o m e s o ft irq s a re p e n d in g ( s e e S e ct io n 4 . 7 . 1 )

Th e smp_local_timer_interrupt( ) fu n ct io n e xe cu t e s t h e p e r- CPU t im e ke e p in g a ct ivit ie s . Act u a lly, it p e rfo rm s t h e fo llo win g s t e p s : 1 . In vo ke s t h e x86_do_profile( ) fu n ct io n if t h e t im e r in t e rru p t o ccu rre d in Ke rn e l Mo d e ( s e e S e ct io n 6 . 5 . 3 la t e r in t h is ch a p t e r)

2 . In vo ke s t h e update_process_times( ) fu n ct io n t o ch e ck h o w lo n g t h e cu rre n t p ro ce s s h a s b e e n ru n n in g ( s e e S e ct io n 6 . 6 la t e r in t h is ch a p t e r) [ 4 ] [4]

Th e s ys t e m a d m in is t ra t o r ca n ch a n g e t h e s a m p le fre q u e n cy o f t h e ke rn e l co d e p ro file r. To d o t h is , t h e ke rn e l ch a n g e s t h e fre q u e n cy a t wh ich lo ca l t im e r in t e rru p t s a re g e n e ra t e d . Ho we ve r, t h e smp_local_timer_interrupt( ) fu n ct io n ke e p s in vo kin g t h e

update_process_times( ) fu n ct io n e xa ct ly o n ce e ve ry t ick. Un fo rt u n a t e ly, ch a n g in g t h e fre q u e n cy o f a lo ca l t im e r in t e rru p t d e s t ro ys t h e e le g a n t s p re a d in g o f t h e lo ca l t im e r in t e rru p t s in s id e a t ick in t e rva l.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.3 CPU's Time Sharing Tim e r in t e rru p t s a re e s s e n t ia l fo r t im e - s h a rin g t h e CPU a m o n g ru n n a b le p ro ce s s e s ( t h a t is , t h o s e in t h e TASK_RUNNING s t a t e ) . As we s h a ll s e e in Ch a p t e r 1 1 , e a ch p ro ce s s is u s u a lly a llo we d a q u a n t u m o f t im e o f lim it e d d u ra t io n : if t h e p ro ce s s is n o t t e rm in a t e d wh e n it s q u a n t u m e xp ire s , t h e schedule( ) fu n ct io n s e le ct s t h e n e w p ro ce s s t o ru n .

Th e counter fie ld o f t h e p ro ce s s d e s crip t o r s p e cifie s h o w m a n y t icks o f CPU t im e a re le ft t o t h e p ro ce s s . Th e q u a n t u m is a lwa ys a m u lt ip le o f a t ick — a m u lt ip le o f a b o u t 1 0 m s . Th e va lu e o f counter is u p d a t e d a t e ve ry t ick b y update_ process_times( ), wh ich is in vo ke d b y e it h e r t h e PIT's t im e r in t e rru p t h a n d le r o n u n ip ro ce s s o r s ys t e m s o r t h e lo ca l t im e r in t e rru p t h a n d le r in m u lt ip ro ce s s o r s ys t e m s . Th e co d e is e q u iva le n t t o t h e fo llo win g :

if (current->pid) { --current->counter; if (current->counter counter = 0; current->need_resched = 1; } } Th e s n ip p e t o f co d e s t a rt s b y m a kin g s u re t h e ke rn e l is n o t h a n d lin g a p ro ce s s wit h PID 0 — t h e s w a p p e r p ro ce s s a s s o cia t e d wit h t h e e xe cu t in g CPU. It m u s t n o t b e t im e - s h a re d b e ca u s e it is t h e p ro ce s s t h a t ru n s o n t h e CPU wh e n n o o t h e r TASK_RUNNING p ro ce s s e s e xis t ( s e e S e ct io n 3 . 2 . 2 ) . Wh e n counter b e co m e s s m a lle r t h a n 0 , t h e need_resched fie ld o f t h e p ro ce s s d e s crip t o r is s e t t o 1 . In t h is ca s e , t h e schedule( ) fu n ct io n is in vo ke d b e fo re re s u m in g Us e r Mo d e e xe cu t io n , a n d o t h e r TASK_RUNNING p ro ce s s e s will h a ve a ch a n ce t o re s u m e e xe cu t io n o n t h e CPU. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.4 Updating the Time and Date Us e r p ro g ra m s g e t t h e cu rre n t t im e a n d d a t e fro m t h e xtime va ria b le o f t yp e struct

timeval. Th e ke rn e l a ls o o cca s io n a lly re fe rs t o it , fo r in s t a n ce , wh e n u p d a t in g in o d e t im e s t a m p s ( s e e S e ct io n 1 . 5 . 4 ) . In p a rt icu la r, xtime.tv_sec s t o re s t h e n u m b e r o f s e co n d s t h a t h a ve e la p s e d s in ce m id n ig h t o f Ja n u a ry 1 , 1 9 7 0 ( UTC) , wh ile xtime.tv_usec s t o re s t h e n u m b e r o f m icro s e co n d s t h a t h a ve e la p s e d wit h in t h e la s t s e co n d ( it s va lu e ra n g e s b e t we e n 0 a n d 9 9 9 9 9 9 ) . Du rin g ke rn e l in it ia liza t io n , t h e time_init( ) fu n ct io n is in vo ke d t o s e t u p t h e t im e a n d d a t e . It re a d s t h e m fro m t h e Re a l Tim e Clo ck b y in vo kin g t h e get_cmos_time( ) fu n ct io n , t h e n it in it ia lize s xtime. On ce t h is h a s b e e n d o n e , t h e ke rn e l d o e s n o t n e e d t h e RTC a n ym o re ; it re lie s in s t e a d o n t h e TIMER_BH b o t t o m h a lf, wh ich is u s u a lly a ct iva t e d o n ce e ve ry t ick. Th e update_times( ) fu n ct io n is e q u iva le n t t o t h e fo llo win g :

void update_times(void) { unsigned long ticks; write_lock_irq(&xtime_lock); ticks = jiffies - wall_jiffies; if (ticks) { wall_jiffies += ticks; update_wall_time(ticks); } write_unlock_irq(&xtime_lock); calc_load(ticks); } On a u n ip ro ce s s o r s ys t e m , t h e write_lock_irq( ) a n d write_unlock_irq( ) fu n ct io n s s im p ly d is a b le a n d e n a b le t h e in t e rru p t s o n t h e e xe cu t in g CPU; o n m u lt ip ro ce s s o r s ys t e m s , t h e y a ls o a cq u ire a n d re le a s e t h e xtime_lock s p in lo ck, wh ich p ro t e ct s a g a in s t co n cu rre n t a cce s s e s t o t h e xtime va ria b le .

Th e wall_jiffies va ria b le s t o re s t h e t im e o f t h e la s t u p d a t e o f t h e xtime va ria b le . Ob s e rve t h a t t h e va lu e o f wall_jiffies ca n b e s m a lle r t h a n jiffies-1, s in ce t h e e xe cu t io n o f t h e b o t t o m h a lf ca n b e d e la ye d ; in o t h e r wo rd s , t h e ke rn e l d o e s n o t n e ce s s a rily u p d a t e t h e xtime va ria b le a t e ve ry t ick. Ho we ve r, n o t ick is d e fin it ive ly lo s t , a n d in t h e lo n g ru n , xtime s t o re s t h e co rre ct s ys t e m t im e .

Th e update_wall_time( ) fu n ct io n in vo ke s t h e update_wall_time_one_tick( ) fu n ct io n ticks co n s e cu t ive t im e s ; e a ch in vo ca t io n a d d s 1 0 , 0 0 0 t o t h e xtime.tv_usec fie ld . [ 5 ] If xtime.tv_usec b e co m e s g re a t e r t h a n 9 9 9 , 9 9 9 , t h e update_wall_time( ) fu n ct io n a ls o u p d a t e s t h e tv_sec fie ld o f xtime.

[5]

In fa ct , t h e fu n ct io n is m u ch m o re co m p le x s in ce it m ig h t t u n e

t h e va lu e 1 0 , 0 0 0 s lig h t ly. Th is m a y b e n e ce s s a ry if a n adjtimex( ) s ys t e m ca ll h a s b e e n is s u e d ( s e e S e ct io n 6 . 7 . 2 la t e r in t h is ch a p t e r) . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.5 Updating System Statistics Th e ke rn e l, a m o n g t h e o t h e r t im e - re la t e d d u t ie s , m u s t p e rio d ica lly co lle ct s e ve ra l d a t a u s e d to: ● ● ●

Ch e ckin g t h e CPU re s o u rce lim it o f t h e ru n n in g p ro ce s s e s Co m p u t in g t h e a ve ra g e s ys t e m lo a d Pro filin g t h e ke rn e l co d e

6.5.1 Checking the Current Process CPU Resource Limit Th e update_process_times( ) fu n ct io n ( in vo ke d b y e it h e r t h e PIT's t im e r in t e rru p t h a n d le r o n u n ip ro ce s s o r s ys t e m s o r t h e lo ca l t im e r in t e rru p t h a n d le r in m u lt ip ro ce s s o r s ys t e m s ) u p d a t e s s o m e ke rn e l s t a t is t ics s t o re d in t h e kstat va ria b le o f t yp e kernel_stat; it t h e n in vo ke s update_one_process( ) t o u p d a t e s o m e fie ld s s t o rin g s t a t is t ics t h a t ca n b e e xp o rt e d t o u s e r p ro g ra m s t h ro u g h t h e times( ) s ys t e m ca ll. In p a rt icu la r, a d is t in ct io n is m a d e b e t we e n CPU t im e s p e n t in Us e r Mo d e a n d in Ke rn e l Mo d e . Th e fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : 1 . Up d a t e s t h e per_cpu_utime fie ld o f current's p ro ce s s d e s crip t o r, wh ich s t o re s t h e n u m b e r o f t icks d u rin g wh ich t h e p ro ce s s h a s b e e n ru n n in g in Us e r Mo d e . 2 . Up d a t e s t h e per_cpu_stime fie ld o f current's p ro ce s s d e s crip t o r, wh ich s t o re s t h e n u m b e r o f t icks d u rin g wh ich t h e p ro ce s s h a s b e e n ru n n in g in Ke rn e l Mo d e . 3 . In vo ke s do_process_times( ), wh ich ch e cks wh e t h e r t h e t o t a l CPU t im e lim it h a s b e e n re a ch e d ; if s o , it s e n d s SIGXCPU a n d SIGKILL s ig n a ls t o current. S e ct io n 3 . 2 . 5 d e s crib e s h o w t h e lim it is co n t ro lle d b y t h e rlim[RLIMIT_CPU].rlim_cur fie ld o f e a ch p ro ce s s d e s crip t o r. Two a d d it io n a l fie ld s ca lle d times.tms_cutime a n d times.tms_cstime a re p ro vid e d in t h e p ro ce s s d e s crip t o r t o co u n t t h e n u m b e r o f CPU t icks s p e n t b y t h e p ro ce s s ch ild re n in Us e r Mo d e a n d Ke rn e l Mo d e , re s p e ct ive ly. Fo r re a s o n s o f e fficie n cy, t h e s e fie ld s a re n o t u p d a t e d b y update_one_process( ), b u t ra t h e r wh e n t h e p a re n t p ro ce s s q u e rie s t h e s t a t e o f o n e o f it s ch ild re n ( s e e S e ct io n 3 . 5 ) .

6.5.2 Keeping Track of System Load An y Un ix ke rn e l ke e p s t ra ck o f h o w m u ch CPU a ct ivit y is b e in g ca rrie d o n b y t h e s ys t e m . Th e s e s t a t is t ics a re u s e d b y va rio u s a d m in is t ra t io n u t ilit ie s s u ch a s top. A u s e r wh o e n t e rs t h e uptime co m m a n d s e e s t h e s t a t is t ics a s t h e "lo a d a ve ra g e " re la t ive t o t h e la s t m in u t e , t h e la s t 5 m in u t e s , a n d t h e la s t 1 5 m in u t e s . On a u n ip ro ce s s o r s ys t e m , a va lu e o f 0 m e a n s t h a t t h e re a re n o a ct ive p ro ce s s e s ( b e s id e s t h e s w a p p e r p ro ce s s 0 ) t o ru n , wh ile a va lu e o f 1 m e a n s t h a t t h e CPU is 1 0 0 p e rce n t b u s y wit h a s in g le p ro ce s s , a n d va lu e s g re a t e r t h a n 1 m e a n t h a t t h e CPU is s h a re d a m o n g s e ve ra l a ct ive p ro ce s s e s . [ 6 ] [6]

Lin u x in clu d e s in t h e lo a d a ve ra g e a ll p ro ce s s e s t h a t a re in

TASK_RUNNING a n d TASK_UNINTERRUPTIBLE s t a t e s . Ho we ve r, in n o rm a l co n d it io n s , t h e re a re fe w TASK_UNINTERRUPTIBLE p ro ce s s e s , s o a h ig h lo a d u s u a lly m e a n s t h a t t h e CPU is b u s y. Th e s ys t e m lo a d d a t a is co lle ct e d b y t h e calc_load( ) fu n ct io n , wh ich is in vo ke d b y

update_times( ). Th is a ct ivit y is t h e re fo re p e rfo rm e d in t h e TIMER_BH b o t t o m h a lf. calc_load( ) co u n t s t h e n u m b e r o f p ro ce s s e s in t h e TASK_RUNNING o r TASK_UNINTERRUPTIBLE s t a t e a n d u s e s t h is n u m b e r t o u p d a t e t h e CPU u s a g e s t a t is t ics . 6.5.3 Profiling the Kernel Code Lin u x in clu d e s a m in im a lis t co d e p ro file r u s e d b y Lin u x d e ve lo p e rs t o d is co ve r wh e re t h e ke rn e l s p e n d s it s t im e in Ke rn e l Mo d e . Th e p ro file r id e n t ifie s t h e h o t s p o t s o f t h e ke rn e l — t h e m o s t fre q u e n t ly e xe cu t e d fra g m e n t s o f ke rn e l co d e . Id e n t ifyin g t h e ke rn e l h o t s p o t s is ve ry im p o rt a n t b e ca u s e t h e y m a y p o in t o u t ke rn e l fu n ct io n s t h a t s h o u ld b e fu rt h e r o p t im ize d . Th e p ro file r is b a s e d o n a ve ry s im p le Mo n t e Ca rlo a lg o rit h m : a t e ve ry t im e r in t e rru p t o ccu rre n ce , t h e ke rn e l d e t e rm in e s wh e t h e r t h e in t e rru p t o ccu rre d in Ke rn e l Mo d e ; if s o , t h e ke rn e l fe t ch e s t h e va lu e o f t h e eip re g is t e r b e fo re t h e in t e rru p t io n fro m t h e s t a ck a n d u s e s it t o d is co ve r wh a t t h e ke rn e l wa s d o in g b e fo re t h e in t e rru p t . In t h e lo n g ru n , t h e s a m p le s a ccu m u la t e o n t h e h o t s p o t s . Th e x86_do_profile( ) fu n ct io n co lle ct s t h e d a t a fo r t h e co d e p ro file r. It is in vo ke d e it h e r b y t h e do_timer_interrupt( ) fu n ct io n in u n ip ro ce s s o r s ys t e m s ( b y t h e PIT's t im e r in t e rru p t h a n d le r) o r b y t h e smp_local_timer_interrupt( ) function in m u lt ip ro ce s s o r s ys t e m s ( b y t h e lo ca l t im e r in t e rru p t h a n d le r) . To e n a b le t h e co d e p ro file r, t h e Lin u x ke rn e l m u s t b e b o o t e d b y p a s s in g a s a p a ra m e t e r t h e s t rin g profile=N, wh e re 2 N d e n o t e s t h e s ize o f t h e co d e fra g m e n t s t o b e p ro file d . Th e co lle ct e d d a t a ca n b e re a d fro m t h e / p ro c/ p ro file file . Th e co u n t e rs a re re s e t b y writ in g in t h e s a m e file ; in m u lt ip ro ce s s o r s ys t e m s , writ in g in t o t h e file ca n a ls o ch a n g e t h e s a m p le fre q u e n cy ( s e e t h e e a rlie r s e ct io n S e ct io n 6 . 2 . 2 ) . Ho we ve r, ke rn e l d e ve lo p e rs d o n o t u s u a lly a cce s s / p ro c/ p ro file d ire ct ly; in s t e a d , t h e y u s e t h e re a d p ro file s ys t e m co m m a n d .

6.5.4 Checking the NMI Watchdogs In m u lt ip ro ce s s o r s ys t e m s , Lin u x o ffe rs ye t a n o t h e r fe a t u re t o ke rn e l d e ve lo p e rs : a w a t ch d o g s y s t e m , wh ich m ig h t b e q u it e u s e fu l t o d e t e ct ke rn e l b u g s t h a t ca u s e a s ys t e m fre e ze . To a ct iva t e s u ch wa t ch d o g , t h e ke rn e l m u s t b e b o o t e d wit h t h e nmi_watchdog p a ra m e t e r. Th e wa t ch d o g is b a s e d o n a cle ve r h a rd wa re fe a t u re o f m u lt ip ro ce s s o r m o t h e rb o a rd s : t h e y ca n b ro a d ca s t t h e PIT's in t e rru p t t im e r a s NMI in t e rru p t s t o a ll CPUs . S in ce NMI in t e rru p t s a re n o t m a s ke d b y t h e cli a s s e m b ly la n g u a g e in s t ru ct io n , t h e wa t ch d o g ca n d e t e ct d e a d lo cks e ve n wh e n in t e rru p t s a re d is a b le d . As a co n s e q u e n ce , o n ce e ve ry t ick, a ll CPUs , re g a rd le s s o f wh a t t h e y a re d o in g , s t a rt e xe cu t in g t h e NMI in t e rru p t h a n d le r; in t u rn , t h e h a n d le r in vo ke s do_nmi( ). Th is fu n ct io n g e t s t h e lo g ica l n u m b e r n o f t h e CPU, a n d t h e n ch e cks t h e n t h e n t ry o f t h e

apic_timer_irqs a rra y. If t h e CPU is wo rkin g p ro p e rly, t h e va lu e m u s t b e d iffe re n t fro m t h e va lu e re a d a t t h e p re vio u s NMI in t e rru p t . Wh e n t h e CPU is ru n n in g p ro p e rly, t h e n t h e n t ry o f t h e apic_timer_irqs a rra y is in cre m e n t e d b y t h e lo ca l t im e r in t e rru p t h a n d le r ( s e e t h e e a rlie r s e ct io n S e ct io n 6 . 2 . 2 . 2 ) ; if t h e co u n t e r is n o t in cre m e n t e d , t h e lo ca l t im e r in t e rru p t h a n d le r h a s n o t b e e n e xe cu t e d in a wh o le t ick. No t a g o o d t h in g , yo u kn o w. Wh e n t h e NMI in t e rru p t h a n d le r d e t e ct s a CPU fre e ze , it rin g s a ll t h e b e lls : it lo g s s ca ry m e s s a g e s in t h e s ys t e m lo g file s , d u m p s t h e co n t e n t s o f t h e CPU re g is t e rs a n d o f t h e ke rn e l s t a ck ( ke rn e l o o p s ) , a n d fin a lly kills t h e cu rre n t p ro ce s s . Th is g ive s ke rn e l d e ve lo p e rs a ch a n ce t o d is co ve r wh a t 's g o n e wro n g .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.6 Software Timers A t im e r is a s o ft wa re fa cilit y t h a t a llo ws fu n ct io n s t o b e in vo ke d a t s o m e fu t u re m o m e n t , a ft e r a g ive n t im e in t e rva l h a s e la p s e d ; a t im e - o u t d e n o t e s a m o m e n t a t wh ich t h e t im e in t e rva l a s s o cia t e d wit h a t im e r h a s e la p s e d . Tim e rs a re wid e ly u s e d b o t h b y t h e ke rn e l a n d b y p ro ce s s e s . Mo s t d e vice d rive rs u s e t im e rs t o d e t e ct a n o m a lo u s co n d it io n s — flo p p y d is k d rive rs , fo r in s t a n ce , u s e t im e rs t o s wit ch o ff t h e d e vice m o t o r a ft e r t h e flo p p y h a s n o t b e e n a cce s s e d fo r a wh ile , a n d p a ra lle l p rin t e r d rive rs u s e t h e m t o d e t e ct e rro n e o u s p rin t e r co n d it io n s . Tim e rs a re a ls o u s e d q u it e o ft e n b y p ro g ra m m e rs t o fo rce t h e e xe cu t io n o f s p e cific fu n ct io n s a t s o m e fu t u re t im e ( s e e t h e la t e r s e ct io n S e ct io n 6 . 7 . 3 ) . Im p le m e n t in g a t im e r is re la t ive ly e a s y. Ea ch t im e r co n t a in s a fie ld t h a t in d ica t e s h o w fa r in t h e fu t u re t h e t im e r s h o u ld e xp ire . Th is fie ld is in it ia lly ca lcu la t e d b y a d d in g t h e rig h t n u m b e r o f t icks t o t h e cu rre n t va lu e o f jiffies. Th e fie ld d o e s n o t ch a n g e . Eve ry t im e t h e ke rn e l ch e cks a t im e r, it co m p a re s t h e e xp ira t io n fie ld t o t h e va lu e o f jiffies a t t h e cu rre n t m o m e n t , a n d t h e t im e r e xp ire s wh e n jiffies is g re a t e r o r e q u a l t o t h e s t o re d va lu e . Th is co m p a ris o n is m a d e via t h e time_after, time_before, time_after_eq, a n d

time_before_eq m a cro s , wh ich t a ke ca re o f p o s s ib le o ve rflo ws o f jiffies. Lin u x co n s id e rs t wo t yp e s o f t im e rs ca lle d d y n a m ic t im e rs a n d in t e rv a l t im e rs . Th e firs t t yp e is u s e d b y t h e ke rn e l, wh ile in t e rva l t im e rs m a y b e cre a t e d b y p ro ce s s e s in Us e r Mo d e . [ 7 ] [7]

Ea rlie r ve rs io n s o f Lin u x u s e a t h ird t yp e o f ke rn e l t im e rs : t h e s o - ca lle d s t a t ic t im e rs . S t a t ic t im e rs a re ve ry ru d im e n t a l b e ca u s e t h e y ca n n o t b e d yn a m ica lly a llo ca t e d o r d e s t ro ye d , a n d a t m o s t t h e re co u ld b e 3 2 o f t h e m . S t a t ic t im e rs we re re p la ce d b y d yn a m ic t im e rs , a n d n e w ke rn e ls ( s t a rt in g fro m Ve rs io n 2 . 4 ) n o lo n g e r s u p p o rt t h e m .

On e wo rd o f ca u t io n a b o u t Lin u x t im e rs : s in ce ch e ckin g fo r t im e r fu n ct io n s is a lwa ys d o n e b y b o t t o m h a lve s t h a t m a y b e e xe cu t e d a lo n g t im e a ft e r t h e y h a ve b e e n a ct iva t e d , t h e ke rn e l ca n n o t e n s u re t h a t t im e r fu n ct io n s will s t a rt rig h t a t t h e ir e xp ira t io n t im e s . It ca n o n ly e n s u re t h a t t h e y a re e xe cu t e d e it h e r a t t h e p ro p e r t im e o r a ft e r wit h a d e la y o f u p t o a fe w h u n d re d s o f m illis e co n d s . Fo r t h is re a s o n , t im e rs a re n o t a p p ro p ria t e fo r re a l- t im e a p p lica t io n s in wh ich e xp ira t io n t im e s m u s t b e s t rict ly e n fo rce d .

6.6.1 Dynamic Timers Dy n a m ic t im e rs m a y b e d yn a m ica lly cre a t e d a n d d e s t ro ye d . No lim it is p la ce d o n t h e n u m b e r o f cu rre n t ly a ct ive d yn a m ic t im e rs . A d yn a m ic t im e r is s t o re d in t h e fo llo win g timer_list s t ru ct u re :

struct timer_list { struct list_head list; unsigned long expires;

unsigned long data; void (*function)(unsigned long); }; Th e function fie ld co n t a in s t h e a d d re s s o f t h e fu n ct io n t o b e e xe cu t e d wh e n t h e t im e r e xp ire s . Th e data fie ld s p e cifie s a p a ra m e t e r t o b e p a s s e d t o t h is t im e r fu n ct io n . Th a n ks t o t h e data fie ld , it is p o s s ib le t o d e fin e a s in g le g e n e ra l- p u rp o s e fu n ct io n t h a t h a n d le s t h e t im e - o u t s o f s e ve ra l d e vice d rive rs ; t h e data fie ld co u ld s t o re t h e d e vice ID o r o t h e r m e a n in g fu l d a t a t h a t co u ld b e u s e d b y t h e fu n ct io n t o d iffe re n t ia t e t h e d e vice . Th e expires fie ld s p e cifie s wh e n t h e t im e r e xp ire s ; t h e t im e is e xp re s s e d a s t h e n u m b e r o f t icks t h a t h a ve e la p s e d s in ce t h e s ys t e m s t a rt e d u p . All t im e rs t h a t h a ve a n expires va lu e s m a lle r t h a n o r e q u a l t o t h e va lu e o f jiffies a re co n s id e re d t o b e e xp ire d o r d e ca ye d .

Th e list fie ld in clu d e s t h e lin ks fo r a d o u b ly lin ke d circu la r lis t . Th e re a re 5 1 2 d o u b ly lin ke d circu la r lis t s t o h o ld d yn a m ic t im e rs . Ea ch t im e r is in s e rt e d in t o o n e o f t h e lis t s b a s e d o n t h e va lu e o f t h e expires fie ld . Th e a lg o rit h m t h a t u s e s t h is lis t is d e s crib e d la t e r in t h is ch a p t e r.

To cre a t e a n d a ct iva t e a d yn a m ic t im e r, t h e ke rn e l m u s t : 1 . Cre a t e a n e w timer_list o b je ct — fo r e xa m p le , t. Th is ca n b e d o n e in s e ve ra l wa ys b y:







De fin in g a s t a t ic g lo b a l va ria b le in t h e co d e . De fin in g a lo ca l va ria b le in s id e a fu n ct io n ; in t h is ca s e , t h e o b je ct is s t o re d o n t h e Ke rn e l Mo d e s t a ck. In clu d in g t h e o b je ct in a d yn a m ica lly a llo ca t e d d e s crip t o r.

2 . In it ia lize t h e o b je ct b y in vo kin g t h e init_timer(&t) fu n ct io n . Th is s im p ly s e t s t h e lin ks in t h e list fie ld t o NULL.

3 . Lo a d t h e function fie ld wit h t h e a d d re s s o f t h e fu n ct io n t o b e a ct iva t e d wh e n t h e t im e r d e ca ys . If re q u ire d , lo a d t h e data fie ld wit h a p a ra m e t e r va lu e t o b e p a s s e d t o t h e fu n ct io n . 4 . If t h e d yn a m ic t im e r is n o t a lre a d y in s e rt e d in a lis t , a s s ig n a p ro p e r va lu e t o t h e expires fie ld . Ot h e rwis e , u p d a t e t h e expires fie ld b y in vo kin g t h e mod_timer( ) fu n ct io n , wh ich a ls o t a ke s ca re o f m o vin g t h e o b je ct in t o t h e p ro p e r lis t ( d is cu s s e d s h o rt ly) . 5 . If t h e d yn a m ic t im e r is n o t a lre a d y in s e rt e d in a lis t , in s e rt t h e t e le m e n t in t h e p ro p e r lis t b y in vo kin g t h e add_timer(&t) fu n ct io n .

On ce t h e t im e r h a s d e ca ye d , t h e ke rn e l a u t o m a t ica lly re m o ve s t h e t e le m e n t fro m it s lis t . S o m e t im e s , h o we ve r, a p ro ce s s s h o u ld e xp licit ly re m o ve a t im e r fro m it s lis t u s in g t h e del_timer( ) o r del_timer_sync( ) fu n ct io n s . In d e e d , a s le e p in g p ro ce s s m a y b e

wo ke n u p b e fo re t h e t im e - o u t is o ve r; in t h is ca s e , t h e p ro ce s s m a y ch o o s e t o d e s t ro y t h e t im e r. In vo kin g del_timer( ) o r del_timer_sync( ) o n a t im e r a lre a d y re m o ve d fro m a lis t d o e s n o h a rm , s o re m o vin g t h e t im e r wit h in t h e t im e r fu n ct io n is co n s id e re d a g o o d p ra ct ice .

6.6.1.1 Dynamic timers and race conditions Be in g a s yn ch ro n o u s ly a ct iva t e d , d yn a m ic t im e rs a re p ro n e t o ra ce co n d it io n s . Fo r in s t a n ce , co n s id e r a d yn a m ic t im e r wh o s e fu n ct io n a ct s o n a d is ca rd a b le re s o u rce ( e . g . , a ke rn e l m o d u le o r a file d a t a s t ru ct u re ) . Re le a s in g t h e re s o u rce wit h o u t s t o p p in g t h e t im e r m a y le a d t o d a t a co rru p t io n if t h e t im e r fu n ct io n g o t a ct iva t e d wh e n t h e re s o u rce n o lo n g e r e xis t s . Th u s , a ru le o f t h u m b is t o s t o p t h e t im e r b e fo re re le a s in g t h e re s o u rce :

... del_timer(&t); X_Release_Resources( ); ... In m u lt ip ro ce s s o r s ys t e m s , h o we ve r, t h is co d e is n o t s a fe b e ca u s e t h e t im e r fu n ct io n m ig h t a lre a d y b e ru n n in g o n a n o t h e r CPU wh e n del_timer( ) is in vo ke d . As a re s u lt , re s o u rce s m a y b e re le a s e d wh ile t h e t im e r fu n ct io n is s t ill a ct in g o n t h e m . To a vo id t h is kin d o f ra ce co n d it io n , t h e ke rn e l o ffe rs t h e del_timer_sync( ) fu n ct io n . It re m o ve s t h e t im e r fro m t h e lis t , a n d t h e n it ch e cks wh e t h e r t h e t im e r fu n ct io n is e xe cu t e d o n a n o t h e r CPU; in s u ch a ca s e , del_timer_sync( ) wa it s u n t il t h e t im e r fu n ct io n t e rm in a t e s .

Ot h e r t yp e s o f ra ce co n d it io n s e xis t , o f co u rs e . Fo r in s t a n ce , t h e rig h t wa y t o m o d ify t h e expires fie ld o f a n a lre a d y a ct iva t e d t im e r co n s is t s o f u s in g mod_timer( ), ra t h e r t h a n d e le t in g t h e t im e r a n d re cre a t in g it t h e re a ft e r. In t h e la t t e r a p p ro a ch , t wo ke rn e l co n t ro l p a t h s t h a t wa n t t o m o d ify t h e expires fie ld o f t h e s a m e t im e r m a y m ix e a ch o t h e r u p b a d ly. Th e im p le m e n t a t io n o f t h e t im e r fu n ct io n s is m a d e S MP- s a fe b y m e a n s o f t h e

timerlist_lock s p in lo ck: e ve ry t im e t h e ke rn e l m u s t a cce s s t h e lis t s o f d yn a m ic t im e rs , it d is a b le s t h e in t e rru p t s a n d a cq u ire s t h is s p in lo ck.

6.6.1.2 Dynamic timers handling Ch o o s in g t h e p ro p e r d a t a s t ru ct u re t o im p le m e n t d yn a m ic t im e rs is n o t e a s y. S t rin g in g t o g e t h e r a ll t im e rs in a s in g le lis t wo u ld d e g ra d e s ys t e m p e rfo rm a n ce s , s in ce s ca n n in g a lo n g lis t o f t im e rs a t e ve ry t ick is co s t ly. On t h e o t h e r h a n d , m a in t a in in g a s o rt e d lis t wo u ld n o t b e m u ch m o re e fficie n t , s in ce t h e in s e rt io n a n d d e le t io n o p e ra t io n s wo u ld a ls o b e co s t ly. Th e a d o p t e d s o lu t io n is b a s e d o n a cle ve r d a t a s t ru ct u re t h a t p a rt it io n s t h e expires va lu e s in t o b lo cks o f t icks a n d a llo ws d yn a m ic t im e rs t o p e rco la t e e fficie n t ly fro m lis t s wit h la rg e r expires va lu e s t o lis t s wit h s m a lle r o n e s .

Th e m a in d a t a s t ru ct u re is a n a rra y ca lle d tvecs, wh o s e e le m e n t s p o in t t o five g ro u p s o f lis t s id e n t ifie d b y t h e tv1, tv2, tv3, tv4, a n d tv5 s t ru ct u re s ( s e e Fig u re 6 - 2 ) . Fig u re 6 - 2 . Th e g ro u p s o f lis t s a s s o c ia t e d w it h d y n a m ic t im e rs

Th e tv1 s t ru ct u re is o f t yp e struct timer_vec_root, wh ich in clu d e s a n index fie ld a n d a

vec a rra y o f 2 5 6 list_head e le m e n t s — t h a t is , lis t s o f d yn a m ic t im e rs . It co n t a in s a ll d yn a m ic t im e rs t h a t will d e ca y wit h in t h e n e xt 2 5 5 t icks . Th e index fie ld s p e cifie s t h e cu rre n t ly s ca n n e d lis t ; it is in it ia lize d t o 0 a n d in cre m e n t e d b y 1 ( m o d u lo 2 5 6 ) a t e ve ry t ick. Th e lis t re fe re n ce d b y index co n t a in s a ll d yn a m ic t im e rs t h a t e xp ire d d u rin g t h e cu rre n t t ick; t h e n e xt lis t co n t a in s a ll d yn a m ic t im e rs t h a t will e xp ire in t h e n e xt t ick; t h e ( index+ k ) t h lis t co n t a in s a ll d yn a m ic t im e rs t h a t will e xp ire in e xa ct ly k t icks . Wh e n index re t u rn s t o 0 , it s ig n ifie s t h a t a ll t h e t im e rs in tv1 h a ve b e e n s ca n n e d . In t h is ca s e , t h e lis t p o in t e d t o b y tv2.vec[tv2.index] is u s e d t o re p le n is h tv1.

Th e tv2, tv3, a n d tv4 s t ru ct u re s o f t yp e struct timer_vec co n t a in a ll d yn a m ic t im e rs t h a t will d e ca y wit h in t h e n e xt 2 1 4 - 1 , 2 2 0 - 1 , a n d 2 2 6 - 1 t icks , re s p e ct ive ly. Th e tv5 s t ru ct u re is id e n t ica l t o t h e p re vio u s o n e s , e xce p t t h a t t h e la s t e n t ry o f t h e vec a rra y in clu d e s d yn a m ic t im e rs wit h e xt re m e ly la rg e expires fie ld s . It n e ve r n e e d s t o b e re p le n is h e d fro m a n o t h e r a rra y. Th e timer_vec s t ru ct u re is ve ry s im ila r t o timer_vec_root: it co n t a in s a n index fie ld a n d a vec a rra y o f 6 4 p o in t e rs t o d yn a m ic t im e r lis t s . Th e index fie ld s p e cifie s t h e cu rre n t ly s ca n n e d lis t ; it is in cre m e n t e d b y 1 ( m o d u lo 6 4 ) e ve ry 2 5 6 i- 1 t icks , wh e re i, ra n g in g b e t we e n 2 a n d 5 , is t h e tvi g ro u p n u m b e r. As in t h e ca s e o f tv1, wh e n index re t u rn s t o 0 , t h e lis t p o in t e d t o b y tvj .vec[tvj .index] is u s e d t o re p le n is h tvi ( i ra n g e s b e t we e n 2 a n d 4 , j is e q u a l t o i+ 1 ) . Th u s , t h e firs t e le m e n t o f tv2 h o ld s a lis t o f a ll t im e rs t h a t e xp ire in t h e 2 5 6 t icks fo llo win g t h e tv1 t im e rs ; t h e t im e rs in t h is lis t a re s u fficie n t t o re p le n is h t h e wh o le a rra y tv1. Th e s e co n d e le m e n t o f tv2 h o ld s a ll t im e rs t h a t e xp ire in t h e fo llo win g 2 5 6 t icks , a n d s o o n . S im ila rly, a s in g le e n t ry o f tv3 is s u fficie n t t o re p le n is h t h e wh o le a rra y tv2.

Fig u re 6 - 2 s h o ws h o w t h e s e d a t a s t ru ct u re s a re co n n e ct e d .

Th e timer_bh( ) fu n ct io n a s s o cia t e d wit h t h e TIMER_BH b o t t o m h a lf in vo ke s t h e

run_timer_list( ) a u xilia ry fu n ct io n t o ch e ck fo r d e ca ye d d yn a m ic t im e rs . Th e fu n ct io n re lie s o n a va ria b le s im ila r t o jiffies t h a t is ca lle d timer_jiffies. Th is n e w va ria b le is n e e d e d b e ca u s e a fe w t im e r in t e rru p t s m ig h t o ccu r b e fo re t h e a ct iva t e d TIMER_BH b o t t o m h a lf h a s a ch a n ce t o ru n ; t h is h a p p e n s t yp ica lly wh e n s e ve ra l in t e rru p t s o f d iffe re n t t yp e s a re is s u e d in a s h o rt in t e rva l o f t im e . Th e va lu e o f timer_jiffies re p re s e n t s t h e e xp ira t io n t im e o f t h e d yn a m ic t im e r lis t ye t t o b e ch e cke d : if it co in cid e s wit h t h e va lu e o f jiffies, n o b a cklo g o f b o t t o m h a lf fu n ct io n s h a s a ccu m u la t e d ; if it is s m a lle r t h a n jiffies, t h e n b o t t o m h a lf fu n ct io n s t h a t re fe r t o p re vio u s t icks m u s t b e d e a lt wit h . Th e va ria b le is s e t t o 0 a t s ys t e m s t a rt u p a n d is in cre m e n t e d o n ly b y run_timer_list( ). It s va lu e ca n n e ve r b e g re a t e r t h a n jiffies.

Th e run_timer_list( ) fu n ct io n is e s s e n t ia lly e q u iva le n t t h e fo llo win g C fra g m e n t :

struct list_head *head, *curr; struct timer_list *timer; void (*fn)(unsigned long); unsigned long data; spin_lock_irq(&timerlist_lock); while ((long)(jiffies - timer_jiffies) >= 0) { if (!tv1.index) { int n = 1; do { cascade_timers(tvecs[n]); } while (tvecs[n]->index == 1 && ++n < 5)); } head = &tv1.vec[tv1.index]; for (curr = head->next; curr != head; curr = head->next) { timer = list_entry(curr, struct timer_list, list); fn = timer->function; data= timer->data; detach_timer(timer); timer->list.next = timer->list.prev = NULL; running_timer = timer; spin_unlock_irq(&timerlist_lock); fn(data); spin_lock_irq(&timerlist_lock); running_timer = NULL; } ++timer_jiffies; tv1.index = (tv1.index + 1) & 0xff; } spin_unlock_irq(&timerlist_lock); Th e o u t e rm o s t while lo o p e n d s wh e n timer_jiffies b e co m e s g re a t e r t h a n t h e va lu e o f

jiffies. S in ce t h e va lu e s o f jiffies a n d timer_jiffies u s u a lly co in cid e , t h e o u t e rm o s t while cycle is o ft e n e xe cu t e d o n ly o n ce . In g e n e ra l, t h e o u t e rm o s t lo o p is e xe cu t e d jiffies - timer_jiffies + 1 co n s e cu t ive t im e s . Mo re o ve r, if a t im e r in t e rru p t o ccu rs wh ile run_timer_list( ) is b e in g e xe cu t e d , d yn a m ic t im e rs t h a t d e ca y a t t h is t ick

o ccu rre n ce a re a ls o co n s id e re d , s in ce t h e jiffies va ria b le is a s yn ch ro n o u s ly in cre m e n t e d b y t h e PIT's in t e rru p t h a n d le r ( s e e t h e e a rlie r s e ct io n S e ct io n 6 . 2 . 1 . 1 ) . Du rin g a s in g le e xe cu t io n o f t h e o u t e rm o s t while cycle , t h e d yn a m ic t im e r fu n ct io n s in clu d e d in t h e tv1.vec[tv1.index] lis t a re e xe cu t e d . Be fo re e xe cu t in g a d yn a m ic t im e r fu n ct io n , t h e lo o p in vo ke s t h e detach_timer( ) fu n ct io n t o re m o ve t h e d yn a m ic t im e r fro m t h e lis t . On ce t h e lis t is e m p t ie d , t h e va lu e o f tv1.index is in cre m e n t e d ( m o d u lo 2 5 6 ) a n d t h e va lu e o f timer_jiffies is in cre m e n t e d .

Wh e n tv1.index b e co m e s e q u a l t o 0 , a ll t h e lis t s o f tv1 h a ve b e e n ch e cke d ; in t h is ca s e , it is n e ce s s a ry t o re fill t h e tv1 s t ru ct u re . Th is is a cco m p lis h e d b y t h e cascade_timers( ) fu n ct io n , wh ich t ra n s fe rs t h e d yn a m ic t im e rs in clu d e d in tv2.vec[tv2.index] in t o

tv1.vec, s in ce t h e y will n e ce s s a rily d e ca y wit h in t h e n e xt 2 5 6 t icks . If tv2.index is e q u a l t o 0 , t h e tv2 a rra y o f lis t s m u s t b e re fille d wit h t h e e le m e n t s o f tv3.vec[tv3.index]. No t ice t h a t run_timer_list( ) d is a b le s in t e rru p t s a n d a cq u ire s t h e timerlist_lock s p in lo ck ju s t b e fo re e n t e rin g t h e o u t e rm o s t lo o p ; in t e rru p t s a re e n a b le d a n d t h e s p in lo ck is re le a s e d rig h t b e fo re in vo kin g e a ch d yn a m ic t im e r fu n ct io n , u n t il it s t e rm in a t io n . Th is e n s u re s t h a t t h e d yn a m ic t im e r d a t a s t ru ct u re s a re n o t co rru p t e d b y in t e rle a ve d ke rn e l co n t ro l p a t h s . To s u m u p , t h is ra t h e r co m p le x a lg o rit h m e n s u re s e xce lle n t p e rfo rm a n ce . To s e e wh y, a s s u m e fo r t h e s a ke o f s im p licit y t h a t t h e TIMER_BH b o t t o m h a lf is e xe cu t e d rig h t a ft e r t h e co rre s p o n d in g t im e r in t e rru p t o ccu rs . Th e n , in 2 5 5 t im e r in t e rru p t o ccu rre n ce s o u t o f 2 5 6 ( in 9 9 . 6 % o f t h e ca s e s ) , t h e run_timer_list( ) fu n ct io n ju s t ru n s t h e fu n ct io n s o f t h e d e ca ye d t im e rs , if a n y. To re p le n is h tv1.vec p e rio d ica lly, it is s u fficie n t 6 3 t im e s o u t o f 6 4 t o p a rt it io n t h e lis t p o in t e d t o b y tv2.vec[tv2.index] in t o t h e 2 5 6 lis t s p o in t e d t o b y

tv1.vec. Th e tv2.vec a rra y, in t u rn , m u s t b e re p le n is h e d in 0 . 0 2 p e rce n t o f t h e ca s e s ( t h a t is , o n ce e ve ry 1 6 3 s e co n d s ) . S im ila rly, tv3 is re p le n is h e d e ve ry 2 h o u rs a n d 5 4 m in u t e s , a n d tv4 is re p le n is h e d e ve ry 7 d a ys a n d 1 8 h o u rs . tv5 d o e s n 't n e e d t o b e re p le n is h e d .

6.6.2 An Application of Dynamic Timers To s h o w h o w t h e o u t co m e o f a ll t h e p re vio u s a ct ivit ie s a re a ct u a lly u s e d in t h e ke rn e l, we 'll s h o w a n e xa m p le o f t h e cre a t io n a n d u s e o f a p ro ce s s t im e - o u t . Le t 's a s s u m e t h a t t h e ke rn e l d e cid e s t o s u s p e n d t h e cu rre n t p ro ce s s fo r t wo s e co n d s . It d o e s t h is b y e xe cu t in g t h e fo llo win g co d e :

timeou = 2 * HZ; set_current_state(TASK_INTERRUPTIBLE); /* or TASK_UNINTERRUPTIBLE */ remaining = schedule_timeout(timeout); Th e ke rn e l im p le m e n t s p ro ce s s t im e - o u t s b y u s in g d yn a m ic t im e rs . Th e y a p p e a r in t h e schedule_timeout( ) fu n ct io n , wh ich e s s e n t ia lly e xe cu t e s t h e fo llo win g s t a t e m e n t s :

struct timer_list timer; expire = timeout + jiffies;

init_timer(&timer); timer.expires = expire; timer.data = (unsigned long) current; timer.function = process_timeout; add_timer(&timer); schedule( ); /* process suspended until timer expires */ del_timer_sync(&timer); timeout = expire - jiffies; return (timeout < 0 ? 0 : timeout); Wh e n schedule( ) is in vo ke d , a n o t h e r p ro ce s s is s e le ct e d fo r e xe cu t io n ; wh e n t h e fo rm e r p ro ce s s re s u m e s it s e xe cu t io n , t h e fu n ct io n re m o ve s t h e d yn a m ic t im e r. In t h e la s t s t a t e m e n t , t h e fu n ct io n e it h e r re t u rn s 0 , if t h e t im e - o u t is e xp ire d , o r it re t u rn s t h e n u m b e r o f t icks le ft t o t h e t im e - o u t e xp ira t io n if t h e p ro ce s s wa s a wo ke n fo r s o m e o t h e r re a s o n . Wh e n t h e t im e - o u t e xp ire s , t h e ke rn e l e xe cu t e s t h e fo llo win g fu n ct io n :

void process_timeout(unsigned long data) { struct task_struct * p = (struct task_struct *) data; wake_up_process(p); } Th e run_timer_list( ) fu n ct io n in vo ke s process_timeout( ), p a s s in g a s it s p a ra m e t e r t h e p ro ce s s d e s crip t o r p o in t e r s t o re d in t h e data fie ld o f t h e timer o b je ct . As a re s u lt , t h e s u s p e n d e d p ro ce s s is wo ke n u p . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

6.7 System Calls Related to Timing Measurements S e ve ra l s ys t e m ca lls a llo w Us e r Mo d e p ro ce s s e s t o re a d a n d m o d ify t h e t im e a n d d a t e a n d t o cre a t e t im e rs . Le t 's b rie fly re vie w t h e s e a n d d is cu s s h o w t h e ke rn e l h a n d le s t h e m .

6.7.1 The time( ), ftime( ), and gettimeofday( ) System Calls Pro ce s s e s in Us e r Mo d e ca n g e t t h e cu rre n t t im e a n d d a t e b y m e a n s o f s e ve ra l s ys t e m ca lls :

time( ) Re t u rn s t h e n u m b e r o f e la p s e d s e co n d s s in ce m id n ig h t a t t h e s t a rt o f Ja n u a ry 1 , 1 9 7 0 ( UTC) .

ftime( ) Re t u rn s , in a d a t a s t ru ct u re o f t yp e timeb, t h e n u m b e r o f e la p s e d s e co n d s s in ce m id n ig h t o f Ja n u a ry 1 , 1 9 7 0 ( UTC) a n d t h e n u m b e r o f e la p s e d m illis e co n d s in t h e la s t s e co n d .

gettimeofday( ) Re t u rn s , in a d a t a s t ru ct u re n a m e d timeval, t h e n u m b e r o f e la p s e d s e co n d s s in ce m id n ig h t o f Ja n u a ry 1 , 1 9 7 0 ( UTC) ( a s e co n d d a t a s t ru ct u re n a m e d timezone is n o t cu rre n t ly u s e d ) . Th e fo rm e r s ys t e m ca lls a re s u p e rs e d e d b y gettimeofday( ), b u t t h e y a re s t ill in clu d e d in Lin u x fo r b a ckwa rd co m p a t ib ilit y. We d o n 't d is cu s s t h e m fu rt h e r. Th e gettimeofday( ) s ys t e m ca ll is im p le m e n t e d b y t h e sys_gettimeofday( ) fu n ct io n . To co m p u t e t h e cu rre n t d a t e a n d t im e o f t h e d a y, t h is fu n ct io n in vo ke s do_gettimeofday( ), wh ich e xe cu t e s t h e fo llo win g a ct io n s :

1 . Dis a b le s t h e in t e rru p t s a n d a cq u ire s t h e xtime_lock re a d / writ e s p in lo ck fo r re a d in g . 2 . Ge t s t h e n u m b e r o f m icro s e co n d s e la p s e d in t h e la s t s e co n d b y u s in g t h e fu n ct io n wh o s e a d d re s s is s t o re d in do_gettimeoffset: usec = do_gettimeoffset( );

If t h e CPU h a s a Tim e S t a m p Co u n t e r, t h e do_fast_gettimeoffset( ) fu n ct io n is e xe cu t e d . It re a d s t h e TS C re g is t e r b y u s in g t h e rdtsc a s s e m b ly la n g u a g e in s t ru ct io n ; it t h e n s u b t ra ct s t h e va lu e s t o re d in last_tsc_low t o o b t a in t h e n u m b e r o f CPU cycle s e la p s e d s in ce t h e la s t t im e r in t e rru p t wa s h a n d le d . Th e fu n ct io n co n ve rt s t h a t n u m b e r t o m icro s e co n d s a n d a d d s in t h e d e la y t h a t e la p s e d

b e fo re t h e a ct iva t io n o f t h e t im e r in t e rru p t h a n d le r ( s t o re d in t h e delay_at_last_interrupt va ria b le m e n t io n e d e a rlie r in S e ct io n 6 . 2 . 1 . 1 ) .

If t h e CPU d o e s n o t h a ve a TS C re g is t e r, do_gettimeoffset p o in t s t o t h e

do_slow_gettimeoffset( ) fu n ct io n . It re a d s t h e s t a t e o f t h e 8 2 5 4 ch ip d e vice in t e rn a l o s cilla t o r a n d t h e n co m p u t e s t h e t im e le n g t h e la p s e d s in ce t h e la s t t im e r in t e rru p t . Us in g t h a t va lu e a n d t h e co n t e n t s o f jiffies, it ca n d e rive t h e n u m b e r o f m icro s e co n d s e la p s e d in t h e la s t s e co n d . 3 . Fu rt h e r in cre a s e s t h e n u m b e r o f m icro s e co n d s in o rd e r t o t a ke in t o a cco u n t a ll t im e r in t e rru p t s wh o s e b o t t o m h a lve s h a ve n o t ye t b e e n e xe cu t e d : usec += (jiffies - wall_jiffies) * (1000000/HZ);

4 . Co p ie s t h e co n t e n t s o f xtime in t o t h e u s e r- s p a ce b u ffe r s p e cifie d b y t h e s ys t e m ca ll p a ra m e t e r tv, a d d in g t o t h e fo llo win g fie ld s : tv->tv_sec = xtime->tv_sec; tv->tv_usec = xtime->tv_usec + usec;

5 . Re le a s e s t h e xtime_lock s p in lo ck a n d re e n a b le s t h e in t e rru p t s .

6 . Ch e cks fo r a n o ve rflo w in t h e m icro s e co n d s fie ld , a d ju s t in g b o t h t h a t fie ld a n d t h e s e co n d fie ld if n e ce s s a ry: while (tv->tv_usec >= 1000000) { tv->tv_usec -= 1000000; tv->tv_sec++; }

Pro ce s s e s in Us e r Mo d e wit h ro o t p rivile g e m a y m o d ify t h e cu rre n t d a t e a n d t im e b y u s in g e it h e r t h e o b s o le t e stime( ) o r t h e settimeofday( ) s ys t e m ca ll. Th e

sys_settimeofday( ) fu n ct io n in vo ke s do_settimeofday( ), wh ich e xe cu t e s o p e ra t io n s co m p le m e n t a ry t o t h o s e o f do_gettimeofday( ). No t ice t h a t b o t h s ys t e m ca lls m o d ify t h e va lu e o f xtime wh ile le a vin g t h e RTC re g is t e rs u n ch a n g e d . Th e re fo re , t h e n e w t im e is lo s t wh e n t h e s ys t e m s h u t s d o wn , u n le s s t h e u s e r e xe cu t e s t h e clo ck p ro g ra m t o ch a n g e t h e RTC va lu e .

6.7.2 The adjtimex( ) System Call Alt h o u g h clo ck d rift e n s u re s t h a t a ll s ys t e m s e ve n t u a lly m o ve a wa y fro m t h e co rre ct t im e , ch a n g in g t h e t im e a b ru p t ly is b o t h a n a d m in is t ra t ive n u is a n ce a n d ris ky b e h a vio r. Im a g in e , fo r in s t a n ce , p ro g ra m m e rs t ryin g t o b u ild a la rg e p ro g ra m a n d d e p e n d in g o n file t im e s t a m p s t o m a ke s u re t h a t o u t - o f- d a t e o b je ct file s a re re co m p ile d . A la rg e ch a n g e in t h e s ys t e m 's t im e co u ld co n fu s e t h e make p ro g ra m a n d le a d t o a n in co rre ct b u ild . Ke e p in g t h e clo cks t u n e d is a ls o im p o rt a n t wh e n im p le m e n t in g a d is t rib u t e d file s ys t e m o n a n e t wo rk o f co m p u t e rs . In t h is ca s e , it is wis e t o a d ju s t t h e clo cks o f t h e in t e rco n n e ct e d PCs s o t h a t t h e t im e s t a m p va lu e s a s s o cia t e d wit h t h e in o d e s o f t h e a cce s s e d file s a re co h e re n t . Th u s , s ys t e m s a re o ft e n co n fig u re d t o ru n a t im e s yn ch ro n iza t io n p ro t o co l s u ch a s Ne t wo rk Tim e Pro t o co l ( NTP) o n a re g u la r b a s is t o ch a n g e t h e t im e g ra d u a lly a t e a ch t ick. Th is u t ilit y

d e p e n d s o n t h e adjtimex( ) s ys t e m ca ll in Lin u x.

Th is s ys t e m ca ll is p re s e n t in s e ve ra l Un ix va ria n t s , a lt h o u g h it s h o u ld n o t b e u s e d in p ro g ra m s in t e n d e d t o b e p o rt a b le . It re ce ive s a s it s p a ra m e t e r a p o in t e r t o a timex s t ru ct u re , u p d a t e s ke rn e l p a ra m e t e rs fro m t h e va lu e s in t h e timex fie ld s , a n d re t u rn s t h e s a m e s t ru ct u re wit h cu rre n t ke rn e l va lu e s . S u ch ke rn e l va lu e s a re u s e d b y update_wall_time_one_tick( ) t o s lig h t ly a d ju s t t h e n u m b e r o f m icro s e co n d s a d d e d t o

xtime.tv_usec a t e a ch t ick. 6.7.3 The setitimer( ) and alarm( ) System Calls Lin u x a llo ws Us e r Mo d e p ro ce s s e s t o a ct iva t e s p e cia l t im e rs ca lle d in t e rv a l t im e rs . [ 8 ] [8]

Th e s e s o ft wa re co n s t ru ct s h a ve n o t h in g in co m m o n wit h t h e Pro g ra m m a b le In t e rva l Tim e r ch ip s d e s crib e d e a rlie r in t h is ch a p t e r.

Th e t im e rs ca u s e Un ix s ig n a ls ( s e e Ch a p t e r 1 0 ) t o b e s e n t p e rio d ica lly t o t h e p ro ce s s . It is a ls o p o s s ib le t o a ct iva t e a n in t e rva l t im e r s o t h a t it s e n d s ju s t o n e s ig n a l a ft e r a s p e cifie d d e la y. Ea ch in t e rva l t im e r is t h e re fo re ch a ra ct e rize d b y: ●



Th e fre q u e n cy a t wh ich t h e s ig n a ls m u s t b e e m it t e d , o r a n u ll va lu e if ju s t o n e s ig n a l h a s t o b e g e n e ra t e d Th e t im e re m a in in g u n t il t h e n e xt s ig n a l is t o b e g e n e ra t e d

Th e e a rlie r wa rn in g a b o u t a ccu ra cy a p p lie s t o t h e s e t im e rs . Th e y a re g u a ra n t e e d t o e xe cu t e a ft e r t h e re q u e s t e d t im e h a s e la p s e d , b u t it is im p o s s ib le t o p re d ict e xa ct ly wh e n t h e y will b e d e live re d . In t e rva l t im e rs a re a ct iva t e d b y m e a n s o f t h e POS IX setitimer( ) s ys t e m ca ll. Th e firs t p a ra m e t e r s p e cifie s wh ich o f t h e fo llo win g p o licie s s h o u ld b e a d o p t e d :

ITIMER_REAL Th e a ct u a l e la p s e d t im e ; t h e p ro ce s s re ce ive s SIGALRM s ig n a ls .

ITIMER_VIRTUAL Th e t im e s p e n t b y t h e p ro ce s s in Us e r Mo d e ; t h e p ro ce s s re ce ive s SIGVTALRM s ig n a ls .

ITIMER_PROF Th e t im e s p e n t b y t h e p ro ce s s b o t h in Us e r a n d in Ke rn e l Mo d e ; t h e p ro ce s s re ce ive s SIGPROF s ig n a ls .

To im p le m e n t a n in t e rva l t im e r fo r e a ch o f t h e p re ce d in g p o licie s , t h e p ro ce s s d e s crip t o r in clu d e s t h re e p a irs o f fie ld s :

● ● ●

it_real_incr a n d it_real_value it_virt_incr a n d it_virt_value it_prof_incr a n d it_prof_value

Th e firs t fie ld o f e a ch p a ir s t o re s t h e in t e rva l in t icks b e t we e n t wo s ig n a ls ; t h e o t h e r fie ld s t o re s t h e cu rre n t va lu e o f t h e t im e r. Th e ITIMER_REAL in t e rva l t im e r is im p le m e n t e d b y u s in g d yn a m ic t im e rs b e ca u s e t h e ke rn e l m u s t s e n d s ig n a ls t o t h e p ro ce s s e ve n wh e n it is n o t ru n n in g o n t h e CPU. Th e re fo re , e a ch p ro ce s s d e s crip t o r in clu d e s a d yn a m ic t im e r o b je ct ca lle d real_timer. Th e

setitimer( ) s ys t e m ca ll in it ia lize s t h e real_timer fie ld s a n d t h e n in vo ke s add_timer( ) t o in s e rt t h e d yn a m ic t im e r in t h e p ro p e r lis t . Wh e n t h e t im e r e xp ire s , t h e ke rn e l e xe cu t e s t h e it_real_fn( ) t im e r fu n ct io n . In t u rn , t h e it_real_fn( ) fu n ct io n s e n d s a SIGALRM s ig n a l t o t h e p ro ce s s ; if it_real_incr is n o t n u ll, it s e t s t h e expires fie ld a g a in , re a ct iva t in g t h e t im e r. Th e ITIMER_VIRTUAL a n d ITIMER_PROF in t e rva l t im e rs d o n o t re q u ire d yn a m ic t im e rs , s in ce t h e y ca n b e u p d a t e d wh ile t h e p ro ce s s is ru n n in g . Th e do_it_virt( ) a n d

do_it_prof( ) fu n ct io n s a re in vo ke d b y update_one_ process( ), wh ich is ca lle d e it h e r b y t h e PIT's t im e r in t e rru p t h a n d le r ( UP) o r b y t h e lo ca l t im e r in t e rru p t h a n d le rs ( S MP) . Th e re fo re , t h e t wo in t e rva l t im e rs a re u p d a t e d o n ce e ve ry t ick, a n d if t h e y a re e xp ire d , t h e p ro p e r s ig n a l is s e n t t o t h e cu rre n t p ro ce s s . Th e alarm( ) s ys t e m ca ll s e n d s a SIGALRM s ig n a l t o t h e ca llin g p ro ce s s wh e n a s p e cifie d t im e in t e rva l h a s e la p s e d . It is ve ry s im ila r t o setitimer( ) wh e n in vo ke d wit h t h e

ITIMER_REAL p a ra m e t e r, s in ce it u s e s t h e real_timer d yn a m ic t im e r in clu d e d in t h e p ro ce s s d e s crip t o r. Th e re fo re , alarm( ) a n d setitimer( ) wit h p a ra m e t e r ITIMER_REAL ca n n o t b e u s e d a t t h e s a m e t im e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 7. Memory Management We s a w in Ch a p t e r 2 h o w Lin u x t a ke s a d va n t a g e o f 8 0 x 8 6 's s e g m e n t a t io n a n d p a g in g circu it s t o t ra n s la t e lo g ica l a d d re s s e s in t o p h ys ica l o n e s . We a ls o m e n t io n e d t h a t s o m e p o rt io n o f RAM is p e rm a n e n t ly a s s ig n e d t o t h e ke rn e l a n d u s e d t o s t o re b o t h t h e ke rn e l co d e a n d t h e s t a t ic ke rn e l d a t a s t ru ct u re s . Th e re m a in in g p a rt o f t h e RAM is ca lle d d y n a m ic m e m o ry . It is a va lu a b le re s o u rce , n e e d e d n o t o n ly b y t h e p ro ce s s e s b u t a ls o b y t h e ke rn e l it s e lf. In fa ct , t h e p e rfo rm a n ce o f t h e e n t ire s ys t e m d e p e n d s o n h o w e fficie n t ly d yn a m ic m e m o ry is m a n a g e d . Th e re fo re , a ll cu rre n t m u lt it a s kin g o p e ra t in g s ys t e m s t ry t o o p t im ize t h e u s e o f d yn a m ic m e m o ry, a s s ig n in g it o n ly wh e n it is n e e d e d a n d fre e in g it a s s o o n a s p o s s ib le . Th is ch a p t e r, wh ich co n s is t s o f t h re e m a in s e ct io n s , d e s crib e s h o w t h e ke rn e l a llo ca t e s d yn a m ic m e m o ry fo r it s o wn u s e . S e ct io n 7 . 1 a n d S e ct io n 7 . 2 illu s t ra t e t wo d iffe re n t t e ch n iq u e s fo r h a n d lin g p h ys ica lly co n t ig u o u s m e m o ry a re a s , wh ile S e ct io n 7 . 3 illu s t ra t e s a t h ird t e ch n iq u e t h a t h a n d le s n o n co n t ig u o u s m e m o ry a re a s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

7.1 Page Frame Management We s a w in S e ct io n 2 . 4 h o w t h e In t e l Pe n t iu m p ro ce s s o r ca n u s e t wo d iffe re n t p a g e fra m e s ize s : 4 KB a n d 4 MB ( o r 2 MB if PAE is e n a b le d —s e e S e ct io n 2 . 4 . 6 ) . Lin u x a d o p t s t h e s m a lle r 4 KB p a g e fra m e s ize a s t h e s t a n d a rd m e m o ry a llo ca t io n u n it . Th is m a ke s t h in g s s im p le r fo r t wo re a s o n s : ●



Th e Pa g e Fa u lt e xce p t io n s is s u e d b y t h e p a g in g circu it ry a re e a s ily in t e rp re t e d . Eit h e r t h e p a g e re q u e s t e d e xis t s b u t t h e p ro ce s s is n o t a llo we d t o a d d re s s it , o r t h e p a g e d o e s n o t e xis t . In t h e s e co n d ca s e , t h e m e m o ry a llo ca t o r m u s t fin d a fre e 4 KB p a g e fra m e a n d a s s ig n it t o t h e p ro ce s s . Th e 4 KB s ize is a m u lt ip le o f m o s t d is k b lo ck s ize s , s o t ra n s fe rs o f d a t a b e t we e n m a in m e m o ry a n d d is ks a re m o re e fficie n t . Ye t t h is s m a lle r s ize is m u ch m o re m a n a g e a b le t h a n t h e 4 MB s ize .

7.1.1 Page Descriptors Th e ke rn e l m u s t ke e p t ra ck o f t h e cu rre n t s t a t u s o f e a ch p a g e fra m e . Fo r in s t a n ce , it m u s t b e a b le t o d is t in g u is h t h e p a g e fra m e s t h a t a re u s e d t o co n t a in p a g e s t h a t b e lo n g t o p ro ce s s e s fro m t h o s e t h a t co n t a in ke rn e l co d e o r ke rn e l d a t a s t ru ct u re s . S im ila rly, it m u s t b e a b le t o d e t e rm in e wh e t h e r a p a g e fra m e in d yn a m ic m e m o ry is fre e . A p a g e fra m e in d yn a m ic m e m o ry is fre e if it d o e s n o t co n t a in a n y u s e fu l d a t a . It is n o t fre e wh e n t h e p a g e fra m e co n t a in s d a t a o f a Us e r Mo d e p ro ce s s , d a t a o f a s o ft wa re ca ch e , d yn a m ica lly a llo ca t e d ke rn e l d a t a s t ru ct u re s , b u ffe re d d a t a o f a d e vice d rive r, co d e o f a ke rn e l m o d u le , a n d so on. S t a t e in fo rm a t io n o f a p a g e fra m e is ke p t in a p a g e d e s crip t o r o f t yp e struct page, wh o s e fie ld s a re s h o wn in Ta b le 7 - 1 . All p a g e d e s crip t o rs a re s t o re d in t h e mem_map a rra y. S in ce e a ch d e s crip t o r is le s s t h a n 6 4 b yt e s lo n g , mem_map re q u ire s a b o u t fo u r p a g e fra m e s fo r e a ch m e g a b yt e o f RAM.

Ta b le 7 - 1 . Th e fie ld s o f t h e p a g e d e s c rip t o r

Ty p e

Na m e

D e s c rip t io n

struct list_head

list

Co n t a in s p o in t e rs t o n e xt a n d p re vio u s it e m s in a d o u b ly lin ke d lis t o f p a g e d e s crip t o rs

struct address_space * mapping

Us e d wh e n t h e p a g e is in s e rt e d in t o t h e p a g e ca ch e ( s e e S e ct io n 1 4 . 1 )

unsigned long

index

Eit h e r t h e p o s it io n o f t h e d a t a s t o re d in t h e p a g e wit h in t h e p a g e 's d is k im a g e ( s e e Ch a p t e r 1 4 ) o r a s wa p p e d - o u t p a g e id e n t ifie r ( s e e Ch a p t e r 1 6 )

struct page *

next_hash Co n t a in s p o in t e r t o n e xt it e m in a d o u b ly lin ke d circu la r lis t o f t h e p a g e ca ch e h a s h t a b le

atomic_t

count

Pa g e 's re fe re n ce co u n t e r

unsigned long

flags

Arra y o f fla g s ( s e e Ta b le 7 - 2 )

struct list_head

lru

Co n t a in s p o in t e rs t o t h e le a s t re ce n t ly u s e d d o u b ly lin ke d lis t of pa ge s

wait_queue_head_t

wait

struct page * *

pprev_hash Co n t a in s p o in t e r t o p re vio u s it e m in a d o u b ly lin ke d circu la r lis t

Pa g e 's wa it q u e u e

o f t h e p a g e ca ch e h a s h t a b le

struct buffer_head *

buffers

Us e d wh e n t h e p a g e s t o re s b u ffe rs ( s e e S e ct io n 1 3 . 4 . 8 . 2 )

void *

virtual

Lin e a r a d d re s s o f t h e p a g e fra m e in t h e fo u rt h g ig a b yt e ( s e e S e ct io n 7 . 1 . 5 la t e r in t h is ch a p t e r)

struct zone_struct *

zone

Th e zo n e t o wh ich t h e p a g e fra m e b e lo n g s ( s e e S e ct io n 7 . 1 . 2 )

Yo u d o n 't h a ve t o fu lly u n d e rs t a n d t h e ro le o f a ll fie ld s in t h e p a g e d e s crip t o r rig h t n o w. In t h e fo llo win g ch a p t e rs , we o ft e n co m e b a ck t o t h e fie ld s o f t h e p a g e d e s crip t o r. Mo re o ve r, s e ve ra l fie ld s h a ve d iffe re n t m e a n in g , a cco rd in g t o wh e t h e r t h e p a g e fra m e is fre e a n d wh a t ke rn e l co m p o n e n t is u s in g t h e p a g e fra m e . Le t 's d e s crib e in g re a t e r d e t a il t wo o f t h e fie ld s :

count A u s a g e re fe re n ce co u n t e r fo r t h e p a g e . If it is s e t t o 0 , t h e co rre s p o n d in g p a g e fra m e is fre e a n d ca n b e a s s ig n e d t o a n y p ro ce s s o r t o t h e ke rn e l it s e lf. If it is s e t t o a va lu e g re a t e r t h a n 0 , t h e p a g e fra m e is a s s ig n e d t o o n e o r m o re p ro ce s s e s o r is u s e d t o s t o re s o m e ke rn e l d a t a s t ru ct u re s .

flags In clu d e s u p t o 3 2 fla g s ( s e e Ta b le 7 - 2 ) t h a t d e s crib e t h e s t a t u s o f t h e p a g e fra m e . Fo r e a ch PG_x y z fla g , t h e ke rn e l d e fin e s s o m e m a cro s t h a t m a n ip u la t e it s va lu e . Us u a lly, t h e PageXy z m a cro re t u rn s t h e va lu e o f t h e fla g , wh ile t h e SetPageXy z a n d ClearPageXy z m a cro s e t a n d cle a r t h e co rre s p o n d in g b it , re s p e ct ive ly.

Ta b le 7 - 2 . Fla g s d e s c rib in g t h e s t a t u s o f a p a g e fra m e

Fla g n a m e

Me a n in g

PG_locked

Th e p a g e is in vo lve d in a d is k I/ O o p e ra t io n .

PG_error

An I/ O e rro r o ccu rre d wh ile t ra n s fe rrin g t h e p a g e .

PG_referenced Th e p a g e h a s b e e n re ce n t ly a cce s s e d fo r a d is k I/ O o p e ra t io n . PG_uptodate

Th e fla g is s e t a ft e r co m p le t in g a re a d o p e ra t io n , u n le s s a d is k I/ O e rro r h a p p e n e d .

PG_dirty

Th e p a g e h a s b e e n m o d ifie d ( s e e S e ct io n 1 6 . 5 . 1 ) .

PG_lru

Th e p a g e is in t h e a ct ive o r in a ct ive p a g e lis t ( s e e S e ct io n 1 6 . 7 . 2 ) .

PG_active

Th e p a g e is in t h e a ct ive p a g e lis t ( s e e S e ct io n 1 6 . 7 . 2 ) .

PG_slab

Th e p a g e fra m e is in clu d e d in a s la b ( s e e S e ct io n 7 . 2 la t e r in t h is ch a p t e r) .

PG_skip

No t u s e d .

PG_highmem

Th e p a g e fra m e b e lo n g s t o t h e ZONE_HIGHMEM zo n e ( s e e S e ct io n 7 . 1 . 2 ) .

PG_checked

Th e fla g u s e d b y t h e Ext 2 file s ys t e m ( s e e Ch a p t e r 1 7 ) .

PG_arch_1

No t u s e d o n t h e 8 0 x 8 6 a rch it e ct u re .

PG_reserved

Th e p a g e fra m e is re s e rve d t o ke rn e l co d e o r is u n u s a b le .

PG_launder

Th e p a g e is in vo lve d in a n I/ O o p e ra t io n t rig g e re d b y shrink_cache( ) ( s e e S e ct io n 16.7.5).

7.1.2 Memory Zones In a n id e a l co m p u t e r a rch it e ct u re , a p a g e fra m e is a m e m o ry s t o ra g e u n it t h a t ca n b e u s e d fo r a n yt h in g : s t o rin g ke rn e l a n d u s e r d a t a , b u ffe rin g d is k d a t a , a n d s o o n . An y kin d o f p a g e o f d a t a ca n b e s t o re d in a n y p a g e fra m e , wit h o u t lim it a t io n s . Ho we ve r, re a l co m p u t e r a rch it e ct u re s h a ve h a rd wa re co n s t ra in t s t h a t m a y lim it t h e wa y p a g e fra m e s ca n b e u s e d . In p a rt icu la r, t h e Lin u x ke rn e l m u s t d e a l wit h t wo h a rd wa re co n s t ra in t s o f t h e 8 0 x 8 6 a rch it e ct u re : ●



Th e Dire ct Me m o ry Acce s s ( DMA) p ro ce s s o rs fo r IS A b u s e s h a ve a s t ro n g lim it a t io n : t h e y a re a b le t o a d d re s s o n ly t h e firs t 1 6 MB o f RAM. In m o d e rn 3 2 - b it co m p u t e rs wit h lo t s o f RAM, t h e CPU ca n n o t d ire ct ly a cce s s a ll p h ys ica l m e m o ry b e ca u s e t h e lin e a r a d d re s s s p a ce is t o o s m a ll.

To co p e wit h t h e s e t wo lim it a t io n s , Lin u x p a rt it io n s t h e p h ys ica l m e m o ry in t h re e z o n e s :

ZONE_DMA Co n t a in s p a g e s o f m e m o ry b e lo w 1 6 MB

ZONE_NORMAL Co n t a in s p a g e s o f m e m o ry a t a n d a b o ve 1 6 MB a n d b e lo w 8 9 6 MB

ZONE_HIGHMEM Co n t a in s p a g e s o f m e m o ry a t a n d a b o ve 8 9 6 MB Th e ZONE_DMA zo n e in clu d e s m e m o ry p a g e s t h a t ca n b e u s e d b y o ld IS A- b a s e d d e vice s b y m e a n s o f t h e DMA. ( S e ct io n 1 3 . 1 . 4 g ive s fu rt h e r d e t a ils o n DMA. ) Th e ZONE_DMA a n d ZONE_NORMAL zo n e s in clu d e t h e "n o rm a l" p a g e s o f m e m o ry t h a t ca n b e d ire ct ly a cce s s e d b y t h e ke rn e l t h ro u g h t h e lin e a r m a p p in g in t h e fo u rt h g ig a b yt e o f t h e lin e a r a d d re s s s p a ce ( s e e S e ct io n 2 . 5 . 5 ) . Co n ve rs e ly, t h e ZONE_HIGHMEM zo n e in clu d e s p a g e s o f m e m o ry t h a t ca n n o t b e d ire ct ly a cce s s e d b y t h e ke rn e l t h ro u g h t h e lin e a r m a p p in g in t h e fo u rt h g ig a b yt e o f lin e a r a d d re s s s p a ce ( s e e S e ct io n 7 . 1 . 6 la t e r in t h is ch a p t e r) . Th e ZONE_HIGHMEM zo n e is n o t u s e d o n 6 4 - b it a rch it e ct u re s .

Ea ch m e m o ry zo n e h a s it s o wn d e s crip t o r o f t yp e struct zone_struct ( o r e q u iva le n t ly, zone_t) . It s fie ld s a re s h o wn in Ta b le 7 - 3 .

Ta b le 7 - 3 . Th e fie ld s o f t h e z o n e d e s c rip t o r

Ty p e

Na m e

D e s c rip t io n

char *

name

Co n t a in s a p o in t e r t o t h e co n ve n t io n a l n a m e o f t h e zo n e : "DMA, " "No rm a l, " o r "Hig h Me m "

unsigned long

size

Nu m b e r o f p a g e s in t h e zo n e

spinlock_t

lock

S p in lo ck p ro t e ct in g t h e d e s crip t o r

unsigned long

free_pages

Nu m b e r o f fre e p a g e s in t h e zo n e

unsigned long

pages_min

Min im u m n u m b e r o f p a g e s o f t h e zo n e t h a t s h o u ld re m a in fre e ( s e e S e ct io n 1 6 . 7 )

unsigned long

pages_low

Lo we r t h re s h o ld va lu e fo r t h e zo n e 's p a g e b a la n cin g a lg o rit h m ( s e e S e ct io n 1 6 . 7 )

unsigned long

pages_high

Up p e r t h re s h o ld va lu e fo r t h e zo n e 's p a g e b a la n cin g a lg o rit h m ( s e e S e ct io n 1 6 . 7 )

int

need_balance

Fla g in d ica t in g t h a t t h e zo n e 's p a g e b a la n cin g a lg o rit h m s h o u ld b e a ct iva t e d ( s e e S e ct io n 1 6 . 7 )

free_area_t [ ]

free_area

Us e d b y t h e b u d d y s ys t e m p a g e a llo ca t o r ( s e e t h e la t e r s e ct io n S e ct io n 7 . 1 . 7 )

struct pglist_data * zone_pgdat

Po in t e r t o t h e d e s crip t o r o f t h e n o d e t o wh ich t h is zo n e b e lo n g s

struct page *

zone_mem_map

Arra y o f p a g e d e s crip t o rs o f t h e zo n e ( s e e t h e la t e r s e ct io n S e ct io n 7 . 1 . 7 )

unsigned long

zone_start_paddr Firs t p h ys ica l a d d re s s o f t h e zo n e

unsigned long

zone_start_mapnr Firs t p a g e d e s crip t o r in d e x o f t h e zo n e

Th e zone fie ld in t h e p a g e d e s crip t o r p o in t s t o t h e d e s crip t o r o f t h e zo n e t o wh ich t h e co rre s p o n d in g p a g e fra m e b e lo n g s . Th e zone_names a rra y s t o re s t h e ca n o n ica l n a m e s o f t h e t h re e zo n e s : "DMA, " "No rm a l, " a n d "Hig h Me m . "

Wh e n t h e ke rn e l in vo ke s a m e m o ry a llo ca t io n fu n ct io n , it m u s t s p e cify t h e zo n e s t h a t co n t a in t h e re q u e s t e d p a g e fra m e s . Th e ke rn e l u s u a lly s p e cifie s wh ich zo n e s it 's willin g t o u s e . Fo r in s t a n ce , if a p a g e fra m e m u s t b e d ire ct ly m a p p e d in t h e fo u rt h g ig a b yt e o f lin e a r a d d re s s e s b u t it is n o t g o in g t o b e u s e d fo r IS A DMA

t ra n s fe rs , t h e n t h e ke rn e l re q u e s t s a p a g e fra m e e it h e r in ZONE_NORMAL o r in ZONE_DMA. Of co u rs e , t h e p a g e fra m e s h o u ld b e o b t a in e d fro m ZONE_DMA o n ly if ZONE_NORMAL d o e s n o t h a ve fre e p a g e fra m e s . To s p e cify t h e p re fe rre d zo n e s in a m e m o ry a llo ca t io n re q u e s t , t h e ke rn e l u s e s t h e struct

zonelist_struct d a t a s t ru ct u re ( o r e q u iva le n t ly zonelist_t) , wh ich is a n a rra y o f zo n e d e s crip t o r p o in t e rs .

7.1.3 Non-Uniform Memory Access (NUMA) We a re u s e d t o t h in kin g o f t h e co m p u t e r's m e m o ry a s a n h o m o g e n e o u s , s h a re d re s o u rce . Dis re g a rd in g t h e ro le o f t h e h a rd wa re ca ch e s , we e xp e ct t h e t im e re q u ire d fo r a CPU t o a cce s s a m e m o ry lo ca t io n is e s s e n t ia lly t h e s a m e , re g a rd le s s o f t h e lo ca t io n 's p h ys ica l a d d re s s a n d t h e CPU. Un fo rt u n a t e ly, t h is a s s u m p t io n is n o t t ru e in s o m e a rch it e ct u re s . Fo r in s t a n ce , it is n o t t ru e fo r s o m e m u lt ip ro ce s s o r Alp h a o r MIPS co m p u t e rs . Lin u x 2 . 4 s u p p o rt s t h e No n - Un ifo rm Me m o ry Acce s s ( NUMA) m o d e l, in wh ich t h e a cce s s t im e s fo r d iffe re n t m e m o ry lo ca t io n s fro m a g ive n CPU m a y va ry. Th e p h ys ica l m e m o ry o f t h e s ys t e m is p a rt it io n e d in s e ve ra l n o d e s . Th e t im e n e e d e d b y a n y g ive n CPU t o a cce s s p a g e s wit h in a s in g le n o d e is t h e s a m e . Ho we ve r, t h is t im e m ig h t n o t b e t h e s a m e fo r t wo d iffe re n t CPUs . Fo r e ve ry CPU, t h e ke rn e l t rie s t o m in im ize t h e n u m b e r o f a cce s s e s t o co s t ly n o d e s b y ca re fu lly s e le ct in g wh e re t h e ke rn e l d a t a s t ru ct u re s t h a t a re m o s t o ft e n re fe re n ce d b y t h e CPU a re s t o re d . Th e p h ys ica l m e m o ry in s id e e a ch n o d e ca n b e s p lit in s e ve ra l zo n e s , a s we s a w in t h e p re vio u s s e ct io n . Ea ch n o d e h a s a d e s crip t o r o f t yp e pg_data_t, wh o s e fie ld s a re s h o wn in Ta b le 7 - 4 . All n o d e d e s crip t o rs a re s t o re d in a s im p ly lin ke d lis t , wh o s e firs t e le m e n t is p o in t e d t o b y t h e pgdat_list va ria b le .

Ta b le 7 - 4 . Th e fie ld s o f t h e n o d e d e s c rip t o r

Ty p e

Na m e

D e s c rip t io n

zone_t [ ]

node_zones

Arra y o f zo n e d e s crip t o rs o f t h e n o d e

zonelist_t [ ]

node_zonelists

Arra y o f zonelist_t d a t a s t ru ct u re s u s e d b y t h e p a g e a llo ca t o r ( s e e t h e la t e r s e ct io n S e ct io n 7 . 1 . 5 )

int

nr_zones

Nu m b e r o f zo n e s in t h e n o d e

struct page *

node_mem_map

Arra y o f p a g e d e s crip t o rs o f t h e n o d e

unsigned long *

valid_addr_bitmap Bit m a p o f u s a b le p h ys ica l a d d re s s e s fo r t h e n o d e

struct bootmem_data x* bdata

Us e d in t h e ke rn e l in it ia liza t io n p h a s e

unsigned long

node_start_paddr Firs t p h ys ica l a d d re s s o f t h e n o d e

unsigned long

node_start_mapnr Firs t p a g e d e s crip t o r in d e x o f t h e n o d e

unsigned long

node_size

S ize o f t h e n o d e ( in p a g e s )

int

node_id

Id e n t ifie r o f t h e n o d e

pg_data_t *

node_next

Ne xt it e m in t h e n o d e lis t

As u s u a l, we a re m o s t ly co n ce rn e d wit h t h e 8 0 x 8 6 a rch it e ct u re . IBM- co m p a t ib le PCs u s e t h e Un ifo rm Acce s s Me m o ry m o d e l ( UMA) , t h u s t h e NUMA s u p p o rt is n o t re a lly re q u ire d . Ho we ve r, e ve n if NUMA s u p p o rt is n o t co m p ile d in t h e ke rn e l, Lin u x m a ke s u s e o f a s in g le n o d e t h a t in clu d e s a ll s ys t e m p h ys ica l m e m o ry; t h e co rre s p o n d in g d e s crip t o r is s t o re d in t h e contig_page_data va ria b le .

On t h e 8 0 x 8 6 a rch it e ct u re , g ro u p in g t h e p h ys ica l m e m o ry in a s in g le n o d e m ig h t a p p e a r u s e le s s ; h o we ve r, t h is a p p ro a ch m a ke s t h e m e m o ry h a n d lin g co d e m o re p o rt a b le , b e ca u s e t h e ke rn e l m a y a s s u m e t h a t t h e p h ys ica l m e m o ry is p a rt it io n e d in o n e o r m o re n o d e s in a ll a rch it e ct u re s . [ 1 ] [1]

We h a ve a n o t h e r e xa m p le o f t h is kin d o f d e s ig n ch o ice : Lin u x u s e s t h re e le ve ls o f Pa g e Ta b le s e ve n wh e n t h e h a rd wa re a rch it e ct u re d e fin e s ju s t t wo le ve ls ( s e e S e ct io n 2 . 5 ) . 7.1.4 Initialization of the Memory Handling Data Structures Dyn a m ic m e m o ry a n d t h e va lu e s u s e d t o re fe r t o it a re illu s t ra t e d in Fig u re 7 - 1 . Th e zo n e s o f m e m o ry a re n o w d ra wn t o s ca le ; ZONE_NORMAL is u s u a lly la rg e r t h a n ZONE_DMA, a n d , if p re s e n t , ZONE_HIGHMEM is u s u a lly la rg e r t h a n ZONE_NORMAL. No t ice t h a t ZONE_HIGHMEM s t a rt s fro m p h ys ica l a d d re s s 0x38000000, wh ich co rre s p o n d s t o 8 9 6 MB. Fig u re 7 - 1 . Me m o ry la y o u t

We a lre a d y d e s crib e d h o w t h e paging_init( ) fu n ct io n in it ia lize s t h e ke rn e l Pa g e Ta b le s a cco rd in g t o t h e a m o u n t o f RAM in t h e s ys t e m in S e ct io n 2 . 5 . 5 . Be s id e Pa g e Ta b le s , t h e paging_init( ) fu n ct io n a ls o in it ia lize s o t h e r m e m o ry h a n d lin g d a t a s t ru ct u re s . It in vo ke s kmap_init( ), wh ich e s s e n t ia lly s e t s u p t h e

kmap_pte va ria b le t o cre a t e "win d o ws " o f lin e a r a d d re s s e s t h a t a llo w t h e ke rn e l t o a d d re s s t h e ZONE_HIGHMEM zo n e ( s e e S e ct io n 7 . 1 . 6 . 2 la t e r in t h is ch a p t e r) . Th e n , paging_init( ) in vo ke s t h e free_area_init( ) fu n ct io n , p a s s in g a n a rra y s t o rin g t h e s ize s o f t h e t h re e m e m o ry zo n e s t o it . Th e free_area_init( ) fu n ct io n s e t s u p b o t h t h e zo n e d e s crip t o rs a n d t h e p a g e d e s crip t o rs . Th e fu n ct io n re ce ive s t h e zones_size a rra y ( s ize o f e a ch m e m o ry zo n e ) a s it s p a ra m e t e r, a n d e xe cu t e s t h e fo llo win g o p e ra t io n s : [ 2 ] [2]

In NUMA a rch it e ct u re s , t h e s e o p e ra t io n s m u s t b e p e rfo rm e d s e p a ra t e ly o n e ve ry n o d e . Ho we ve r, we a re fo cu s in g o n t h e 8 0 x 8 6 a rch it e ct u re , wh ich h a s ju s t o n e n o d e . 1 . Co m p u t e s t h e t o t a l n u m b e r o f p a g e fra m e s in RAM b y a d d in g t h e va lu e in zones_size, a n d s t o re s

t h e re s u lt in t h e totalpages lo ca l va ria b le .

2 . In it ia lize s t h e active_list a n d inactive_list lis t s o f p a g e d e s crip t o rs ( s e e Ch a p t e r 1 6 ) .

3 . Allo ca t e s s p a ce fo r t h e mem_map a rra y o f p a g e d e s crip t o rs . Th e s p a ce n e e d e d is t h e p ro d u ct o f

totalpages b y t h e p a g e d e s crip t o r s ize . 4 . In it ia lize s s o m e fie ld s o f t h e n o d e d e s crip t o r contig_page_data: contig_page_data.node_size = totalpages; contig_page_data.node_start_paddr = 0x00000000; contig_page_data.node_start_mapnr = 0;

5 . In it ia lize s s o m e fie ld s o f a ll p a g e d e s crip t o rs . All p a g e fra m e s a re m a rke d a s re s e rve d , b u t la t e r, t h e PG_reserved fla g o f t h e p a g e fra m e s in d yn a m ic m e m o ry will b e cle a re d : for (p = mem_map; p < mem_map + totalpages; p++) { p->count = 0; SetPageReserved(p); init_waitqueue_head(&p->wait); p->list.next = p->list.prev = p; }

6 . S t o re s t h e a d d re s s o f t h e m e m o ry zo n e d e s crip t o r in t h e zone lo ca l va ria b le a n d fo r e a ch e le m e n t o f t h e zone_names a rra y ( in d e x j b e t we e n 0 a n d 2 ) , p e rfo rm s t h e fo llo win g s t e p s :

a . In it ia lize s s o m e fie ld s o f t h e d e s crip t o r: zone->name = zone_names[j]; zone->size = zones_size[j]; zone->lock = SPIN_LOCK_UNLOCKED; zone->zone_pgdat = & contig_page_data; zone->free_pages = 0; zone->need_balance = 0;

b . If t h e zo n e is e m p t y ( t h a t is , it d o e s n o t in clu d e a n y p a g e fra m e ) , t h e fu n ct io n g o e s b a ck t o t h e b e g in n in g o f S t e p 6 a n d co n t in u e s wit h t h e n e xt zo n e . c. Ot h e rwis e , t h e zo n e in clu d e s a t le a s t o n e p a g e fra m e a n d t h e fu n ct io n in it ia lize s t h e pages_min, pages_low, a n d pages_high fie ld s o f t h e zo n e d e s crip t o r ( s e e Ch a p t e r 1 6 ) .

d . S e t s u p t h e zone_mem_map fie ld o f t h e zo n e d e s crip t o r t o t h e a d d re s s o f t h e firs t p a g e d e s crip t o r in t h e zo n e . e . S e t s u p t h e zone_start_mapnr fie ld o f t h e zo n e d e s crip t o r t o t h e in d e x o f t h e firs t p a g e d e s crip t o r in t h e zo n e . f. S e t s u p t h e zone_start_paddr fie ld o f t h e zo n e d e s crip t o r t o t h e p h ys ica l a d d re s s o f t h e firs t p a g e fra m e in t h e zo n e . g . S t o re s t h e a d d re s s o f t h e zo n e d e s crip t o r in t h e zone fie ld o f t h e p a g e d e s crip t o r fo r e a ch p a g e fra m e o f t h e zo n e . h . If t h e zo n e is e it h e r ZONE_DMA o r ZONE_NORMAL, s t o re s t h e lin e a r a d d re s s in t h e fo u rt h g ig a b yt e t h a t m a p s t h e p a g e fra m e in t o t h e virtual fie ld o f e ve ry p a g e d e s crip t o r o f t h e zo n e .

i. In it ia lize s t h e free_area_t s t ru ct u re s in t h e free_area a rra y o f t h e zo n e d e s crip t o r ( s e e S e ct io n 7 . 1 . 7 la t e r in t h is ch a p t e r) . 7 . In it ia lize s t h e node_zonelists a rra y o f t h e contig_page_data n o d e d e s crip t o r. Th e a rra y in clu d e s 1 6 e le m e n t s ; e a ch e le m e n t co rre s p o n d s t o a d iffe re n t t yp e o f m e m o ry re q u e s t a n d s p e cifie s t h e zo n e s ( in o rd e r o f p re fe re n ce ) fro m wh e re t h e p a g e fra m e s co u ld b e re t rie ve d . S e e S e ct io n 7 . 1 . 5 la t e r in t h is ch a p t e r fo r fu rt h e r d e t a ils . Wh e n t h e paging_init( ) fu n ct io n t e rm in a t e s , d yn a m ic m e m o ry is n o t ye t u s a b le b e ca u s e t h e

PG_reserved fla g o f a ll p a g e s is s e t . Me m o ry in it ia liza t io n is fu rt h e r ca rrie d o n b y t h e mem_init( ) fu n ct io n , wh ich is in vo ke d s u b s e q u e n t ly t o paging_init( ). Es s e n t ia lly, t h e mem_init( ) fu n ct io n in it ia lize s t h e va lu e o f num_physpages, t h e t o t a l n u m b e r o f p a g e fra m e s p re s e n t in t h e s ys t e m . It t h e n s ca n s a ll p a g e fra m e s a s s o cia t e d wit h t h e d yn a m ic m e m o ry; fo r e a ch o f t h e m , t h e fu n ct io n s e t s t h e count fie ld o f t h e co rre s p o n d in g d e s crip t o r t o 1 , re s e t s t h e PG_reserved fla g , s e t s t h e PG_highmem fla g if t h e p a g e b e lo n g s t o ZONE_HIGHMEM, a n d ca lls t h e free_ page( ) fu n ct io n o n it . Be s id e s re le a s in g t h e p a g e fra m e ( s e e S e ct io n 7 . 1 . 7 la t e r in t h is ch a p t e r) , free_page( ) a ls o in cre m e n t s t h e va lu e o f t h e free_pages fie ld o f t h e m e m o ry zo n e d e s crip t o r t h a t o wn s t h e p a g e fra m e . Th e free_pages fie ld s o f a ll zo n e d e s crip t o rs a re u s e d b y t h e nr_free_pages( ) fu n ct io n t o co m p u t e t h e t o t a l n u m b e r o f fre e p a g e fra m e s in t h e d yn a m ic m e m o ry. Th e mem_init( ) fu n ct io n a ls o co u n t s t h e n u m b e r o f p a g e fra m e s t h a t a re n o t a s s o cia t e d wit h d yn a m ic m e m o ry. S e ve ra l s ym b o ls p ro d u ce d wh ile co m p ilin g t h e ke rn e l ( s o m e a re d e s crib e d in S e ct io n 2 . 5 . 3 ) e n a b le t h e fu n ct io n t o co u n t t h e n u m b e r o f p a g e fra m e s re s e rve d fo r t h e h a rd wa re , ke rn e l co d e , a n d ke rn e l d a t a , a n d t h e n u m b e r o f p a g e fra m e s u s e d d u rin g ke rn e l in it ia liza t io n t h a t ca n b e s u cce s s ive ly re le a s e d .

7.1.5 Requesting and Releasing Page Frames Aft e r h a vin g s e e n h o w t h e ke rn e l a llo ca t e s a n d in it ia lize s t h e d a t a s t ru ct u re s fo r p a g e fra m e h a n d lin g , we n o w lo o k a t h o w p a g e fra m e s a re a llo ca t e d a n d re le a s e d . Pa g e fra m e s ca n b e re q u e s t e d b y u s in g s ix s lig h t ly d iffe re n t fu n ct io n s a n d m a cro s . Un le s s o t h e rwis e s t a t e d , t h e y re t u rn t h e lin e a r a d d re s s o f t h e firs t a llo ca t e d p a g e , o r re t u rn NULL if t h e a llo ca t io n fa ile d .

alloc_pages(gfp_mask, order) Fu n ct io n u s e d t o re q u e s t 2 o rd e r co n t ig u o u s p a g e fra m e s . It re t u rn s t h e a d d re s s o f t h e d e s crip t o r o f t h e firs t a llo ca t e d p a g e fra m e o r re t u rn s NULL if t h e a llo ca t io n fa ile d .

alloc_page(gfp_mask) Ma cro u s e d t o g e t a s in g le p a g e fra m e ; it e xp a n d s t o : alloc_pages(gfp_mask, 0)

It re t u rn s t h e a d d re s s o f t h e d e s crip t o r o f t h e a llo ca t e d p a g e fra m e o r re t u rn s NULL if t h e a llo ca t io n fa ile d .

_ _get_free_pages(gfp_mask, order) Fu n ct io n t h a t is s im ila r t o alloc_pages( ), b u t it re t u rn s t h e lin e a r a d d re s s o f t h e firs t a llo ca t e d page.

_ _get_free_page(gfp_mask)

Ma cro u s e d t o g e t a s in g le p a g e fra m e ; it e xp a n d s t o : _ _get_free_pages(gfp_mask, 0)

get_zeroed_page(gfp_mask) , o r e q u iva le n t ly get_free_page(gfp_mask) Fu n ct io n t h a t in vo ke s : alloc_pages(gfp_mask, 0)

a n d t h e n fills t h e p a g e fra m e o b t a in e d wit h ze ro s .

_ _get_dma_pages(gfp_mask, order) Ma cro u s e d t o g e t p a g e fra m e s s u it a b le fo r DMA; it e xp a n d s t o : _ _get_free_pages(gfp_mask | _ _GFP_DMA, order)

Th e p a ra m e t e r gfp_mask s p e cifie s h o w t o lo o k fo r fre e p a g e fra m e s . It co n s is t s o f t h e fo llo win g fla g s :

_ _GFP_WAIT Th e ke rn e l is a llo we d t o b lo ck t h e cu rre n t p ro ce s s wa it in g fo r fre e p a g e fra m e s .

_ _GFP_HIGH Th e ke rn e l is a llo we d t o a cce s s t h e p o o l o f fre e p a g e fra m e s le ft fo r re co ve rin g fro m ve ry lo w m e m o ry co n d it io n s .

_ _GFP_IO Th e ke rn e l is a llo we d t o p e rfo rm I/ O t ra n s fe rs o n lo w m e m o ry p a g e s in o rd e r t o fre e p a g e fra m e s .

_ _GFP_HIGHIO Th e ke rn e l is a llo we d t o p e rfo rm I/ O t ra n s fe rs o n h ig h m e m o ry p a g e s in o rd e r fre e p a g e fra m e s .

_ _GFP_FS Th e ke rn e l is a llo we d t o p e rfo rm lo w- le ve l VFS o p e ra t io n s .

_ _GFP_DMA Th e re q u e s t e d p a g e fra m e s m u s t b e in clu d e d in t h e ZONE_DMA zo n e ( s e e t h e e a rlie r s e ct io n S e ct io n 7.1.2.)

_ _GFP_HIGHMEM Th e re q u e s t e d p a g e fra m e s ca n b e in clu d e d in t h e ZONE_HIGHMEM zo n e .

In p ra ct ice , Lin u x u s e s t h e p re d e fin e d co m b in a t io n s o f fla g va lu e s s h o wn in Ta b le 7 - 5 ; t h e g ro u p n a m e is wh a t yo u 'll e n co u n t e r a s a rg u m e n t o f t h e s ix p a g e fra m e a llo ca t io n fu n ct io n s .

Ta b le 7 - 5 . Gro u p s o f fla g v a lu e s u s e d t o re q u e s t p a g e fra m e s

Gro u p n a m e

Co rre s p o n d in g fla g s

GFP_ATOMIC

_ _GFP_HIGH

GFP_NOIO

_ _GFP_HIGH _ _GFP_WAIT

GFP_NOHIGHIO

_ _GFP_HIGH _ _GFP_WAIT _ _GFP_IO

GFP_NOFS

_ _GFP_HIGH _ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO

GFP_KERNEL

_ _GFP_HIGH _ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO _ _GFP_FS

GFP_NFS

_ _GFP_HIGH _ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO _ _GFP_FS

GFP_KSWAPD

_ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO _ _GFP_FS

GFP_USER

_ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO _ _GFP_FS

GFP_HIGHUSER

_ _GFP_WAIT _ _GFP_IO _ _GFP_HIGHIO _ _GFP_FS _ _GFP_HIGHMEM

Th e _ _GFP_DMA a n d _ _GFP_HIGHMEM fla g s a re ca lle d z o n e m o d ifie rs ; t h e y s p e cify t h e zo n e s s e a rch e d b y t h e ke rn e l wh ile lo o kin g fo r fre e p a g e fra m e s . Th e node_zonelists fie ld o f t h e contig_page_data n o d e d e s crip t o r is a n a rra y o f lis t s o f zo n e d e s crip t o rs ; e a ch lis t is a s s o cia t e d wit h o n e s p e cific co m b in a t io n o f t h e zo n e m o d ifie rs . Alt h o u g h t h e a rra y in clu d e s 1 6 e le m e n t s , o n ly 4 a re re a lly u s e d , s in ce t h e re a re cu rre n t ly o n ly 2 zo n e m o d ifie rs . Th e y a re s h o wn in Ta b le 7 - 6 .

Ta b le 7 - 6 . Zo n e m o d ifie r lis t s

_ _ GFP _ D MA

_ _ GFP _ HI GHMEM

Zo n e lis t

0

0

ZONE_NORMAL + ZONE_DMA

0

1

ZONE_HIGHMEM + ZONE_NORMAL + ZONE_DMA

1

0

ZONE_DMA

1

1

ZONE_DMA

Pa g e fra m e s ca n b e re le a s e d t h ro u g h e a ch o f t h e fo llo win g fo u r fu n ct io n s a n d m a cro s :

_ _free_pages(page, order)

Th is fu n ct io n ch e cks t h e p a g e d e s crip t o r p o in t e d t o b y page; if t h e p a g e fra m e is n o t re s e rve d ( i. e . , if t h e PG_reserved fla g is e q u a l t o 0 ) , it d e cre m e n t s t h e count fie ld o f t h e d e s crip t o r. If count b e co m e s 0 , it a s s u m e s t h a t 2 o rd e r co n t ig u o u s p a g e fra m e s s t a rt in g fro m addr a re n o lo n g e r u s e d . In t h is ca s e , t h e fu n ct io n in vo ke s _ _ free_pages_ok( ) t o in s e rt t h e p a g e fra m e d e s crip t o r o f t h e firs t fre e p a g e in t h e p ro p e r lis t o f fre e p a g e fra m e s ( d e s crib e d in t h e fo llo win g s e ct io n ) .

free_pages(addr, order) Th is fu n ct io n is s im ila r t o _ _free_pages( ), b u t it re ce ive s a s a n a rg u m e n t t h e lin e a r a d d re s s

addr o f t h e firs t p a g e fra m e t o b e re le a s e d . _ _free_page(page) Th is m a cro re le a s e s t h e p a g e fra m e h a vin g t h e d e s crip t o r p o in t e d t o b y page; it e xp a n d s t o : _ _free_pages(page, 0)

free_page(addr) Th is m a cro re le a s e s t h e p a g e fra m e h a vin g t h e lin e a r a d d re s s addr; it e xp a n d s t o : free_pages(addr, 0)

7.1.6 Kernel Mappings of High-Memory Page Frames Pa g e fra m e s a b o ve t h e 8 9 6 MB b o u n d a ry a re n o t m a p p e d in t h e fo u rt h g ig a b yt e o f t h e ke rn e l lin e a r a d d re s s s p a ce s , s o t h e y ca n n o t b e d ire ct ly a cce s s e d b y t h e ke rn e l. Th is im p lie s t h a t a n y p a g e a llo ca t o r fu n ct io n t h a t re t u rn s t h e lin e a r a d d re s s o f t h e a s s ig n e d p a g e fra m e d o e s n 't wo rk fo r t h e h ig h m e m o ry. Fo r in s t a n ce , s u p p o s e t h a t t h e ke rn e l in vo ke d _ _get_free_pages(GFP_HIGHMEM,0) t o a llo ca t e a p a g e fra m e in h ig h m e m o ry. If t h e a llo ca t o r a s s ig n e d a p a g e fra m e in h ig h m e m o ry, _ _get_free_pages( ) ca n n o t re t u rn it s lin e a r a d d re s s b e ca u s e it d o e s n 't e xis t ; t h u s , t h e fu n ct io n re t u rn s NULL. In t u rn , t h e ke rn e l ca n n o t u s e t h e p a g e fra m e ; e ve n wo rs e , t h e p a g e fra m e ca n n o t b e re le a s e d b e ca u s e t h e ke rn e l h a s lo s t t ra ck o f it . In s h o rt , a llo ca t io n o f h ig h - m e m o ry p a g e fra m e s m u s t b e d o n e o n ly t h ro u g h t h e alloc_pages( ) fu n ct io n a n d it s alloc_page( ) s h o rt cu t , wh ich b o t h re t u rn t h e a d d re s s o f t h e p a g e d e s crip t o r o f t h e firs t a llo ca t e d p a g e fra m e . On ce a llo ca t e d , a h ig h - m e m o ry p a g e fra m e h a s t o b e m a p p e d in t o t h e fo u rt h g ig a b yt e o f t h e lin e a r a d d re s s s p a ce , e ve n t h o u g h t h e p h ys ica l a d d re s s o f t h e p a g e fra m e m a y we ll e xce e d 4 GB. To d o t h is , t h e ke rn e l m a y u s e t h re e d iffe re n t m e ch a n is m s , wh ich a re ca lle d p e rm a n e n t k e rn e l m a p p in g s , t e m p o ra ry k e rn e l m a p p in g s , a n d n o n co n t ig u o u s m e m o ry a llo ca t io n . In t h is s e ct io n , we fo cu s o n t h e firs t t wo t e ch n iq u e s ; t h e t h ird o n e is d is cu s s e d in S e ct io n 7 . 3 la t e r in t h is ch a p t e r. Es t a b lis h in g a p e rm a n e n t ke rn e l m a p p in g m a y b lo ck t h e cu rre n t p ro ce s s ; t h is h a p p e n s wh e n n o fre e Pa g e Ta b le e n t rie s e xis t t h a t ca n b e u s e d a s "win d o ws " o n t h e p a g e fra m e s in h ig h m e m o ry ( s e e t h e n e xt s e ct io n ) . Th u s , a p e rm a n e n t ke rn e l m a p p in g ca n n o t b e u s e d in in t e rru p t h a n d le rs a n d d e fe rra b le fu n ct io n s . Co n ve rs e ly, e s t a b lis h in g a t e m p o ra ry ke rn e l m a p p in g n e ve r re q u ire s b lo ckin g t h e cu rre n t p ro ce s s ; it s d ra wb a ck, h o we ve r, is t h a t ve ry fe w t e m p o ra ry ke rn e l m a p p in g s ca n b e e s t a b lis h e d a t t h e s a m e t im e . Of co u rs e , n o n e o f t h e s e t e ch n iq u e s a llo w a d d re s s in g t h e wh o le RAM s im u lt a n e o u s ly. Aft e r a ll, o n ly 1 2 8 MB o f lin e a r a d d re s s s p a ce a re le ft fo r m a p p in g t h e h ig h m e m o ry, wh ile PAE s u p p o rt s s ys t e m s h a vin g u p t o 6 4 GB o f RAM.

7.1.6.1 Permanent kernel mappings

Pe rm a n e n t ke rn e l m a p p in g s a llo w t h e ke rn e l t o e s t a b lis h lo n g - la s t in g m a p p in g s o f h ig h - m e m o ry p a g e fra m e s in t o t h e ke rn e l a d d re s s s p a ce . Th e y u s e a d e d ica t e d Pa g e Ta b le wh o s e a d d re s s is s t o re d in t h e pkmap_page_table va ria b le . Th e n u m b e r o f e n t rie s in t h e Pa g e Ta b le is yie ld e d b y t h e LAST_PKMAP m a cro . As u s u a l, t h e Pa g e Ta b le in clu d e s e it h e r 5 1 2 o r 1 , 0 2 4 e n t rie s , a cco rd in g t o wh e t h e r PAE is e n a b le d o r d is a b le d ( s e e S e ct io n 2 . 4 . 6 ) ; t h u s , t h e ke rn e l ca n a cce s s a t m o s t 2 o r 4 MB o f h ig h m e m o ry a t o n ce . Th e Pa g e Ta b le m a p s t h e lin e a r a d d re s s e s s t a rt in g fro m PKMAP_BASE ( u s u a lly 0xfe000000) . Th e a d d re s s o f t h e d e s crip t o r co rre s p o n d in g t o t h e firs t p a g e fra m e in h ig h m e m o ry is s t o re d in t h e highmem_start_page va ria b le .

Th e pkmap_count a rra y in clu d e s LAST_PKMAP co u n t e rs , o n e fo r e a ch e n t ry o f t h e pkmap_page_table Pa g e Ta b le . We d is t in g u is h t h re e ca s e s : Th e co u n t e r is 0 Th e co rre s p o n d in g Pa g e Ta b le e n t ry d o e s n o t m a p a n y h ig h - m e m o ry p a g e fra m e a n d is u s a b le . Th e co u n t e r is 1 Th e co rre s p o n d in g Pa g e Ta b le e n t ry d o e s n o t m a p a n y h ig h - m e m o ry p a g e fra m e , b u t it ca n n o t b e u s e d b e ca u s e t h e co rre s p o n d in g TLB e n t ry h a s n o t b e e n flu s h e d s in ce it s la s t u s a g e . Th e co u n t e r is n ( g re a t e r t h a n 1 ) Th e co rre s p o n d in g Pa g e Ta b le e n t ry m a p s a h ig h - m e m o ry p a g e fra m e , wh ich is u s e d b y e xa ct ly n - 1 ke rn e l co m p o n e n t s . Th e kmap( ) fu n ct io n e s t a b lis h e s a p e rm a n e n t ke rn e l m a p p in g . It is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g co d e :

void * kmap(struct page * page) { if (page < highmem_page_start) return page->virtual; return kmap_high(page); } Th e virtual fie ld o f t h e p a g e d e s crip t o r s t o re s t h e lin e a r a d d re s s in t h e fo u rt h g ig a b yt e m a p p in g t h e p a g e fra m e , if a n y. Th u s , fo r a n y p a g e fra m e b e lo w t h e 8 9 6 MB b o u n d a ry, t h e fie ld a lwa ys in clu d e s t h e p h ys ica l a d d re s s o f t h e p a g e fra m e p lu s PAGE_OFFSET. Co n ve rs e ly, if t h e p a g e fra m e is in h ig h m e m o ry, t h e

virtual fie ld h a s a n o n - n u ll va lu e o n ly if t h e p a g e fra m e is cu rre n t ly m a p p e d , e it h e r b y t h e p e rm a n e n t o r t h e t e m p o ra ry ke rn e l m a p p in g . Th e kmap_high( ) fu n ct io n is in vo ke d if t h e p a g e fra m e re a lly b e lo n g s t o t h e h ig h m e m o ry. Th e fu n ct io n is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g co d e :

void * kmap_high(struct page * page) { unsigned long vaddr; spin_lock(&kmap_lock); vaddr = (unsigned long) page->virtual; if (!vaddr) vaddr = map_new_virtual(page); pkmap_count[(vaddr-PKMAP_BASE) >> PAGE_SHIFT]++; spin_unlock(&kmap_lock); return (void *) vaddr; }

Th e fu n ct io n g e t s t h e kmap_lock s p in lo ck t o p ro t e ct t h e Pa g e Ta b le a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . No t ice t h a t t h e re is n o n e e d t o d is a b le t h e in t e rru p t s b e ca u s e kmap( ) ca n n o t b e in vo ke d b y in t e rru p t h a n d le rs a n d d e fe rra b le fu n ct io n s . Ne xt , t h e kmap_high( ) fu n ct io n ch e cks wh e t h e r t h e virtual fie ld o f t h e p a g e d e s crip t o r a lre a d y s t o re s a n o n - n u ll lin e a r a d d re s s . If n o t , t h e fu n ct io n in vo ke s t h e map_new_virtual( ) fu n ct io n t o in s e rt t h e p a g e fra m e p h ys ica l a d d re s s in a n e n t ry o f pkmap_page_table. Th e n kmap_high( ) in cre m e n t s t h e co u n t e r co rre s p o n d in g t o t h e lin e a r a d d re s s o f t h e p a g e fra m e b y 1 b e ca u s e a n o t h e r ke rn e l co m p o n e n t is g o in g t o a cce s s t h e p a g e fra m e . Fin a lly,

kmap_high( ) re le a s e s t h e kmap_lock s p in lo ck a n d re t u rn s t h e lin e a r a d d re s s t h a t m a p s t h e p a g e . Th e map_new_virtual( ) fu n ct io n e s s e n t ia lly e xe cu t e s t wo n e s t e d lo o p s :

for (;;) { int count; DECLARE_WAITQUEUE(wait, current); for (count = LAST_PKMAP; count >= 0; --count) { last_pkmap_nr = (last_pkmap_nr + 1) & (LAST_PKMAP - 1); if (!last_pkmap_nr) { flush_all_zero_pkmaps( ); count = LAST_PKMAP; } if (!pkmap_count[last_pkmap_nr]) { unsigned long vaddr = PKMAP_BASE + (last_pkmap_nr virtual = (void *) vaddr; return vaddr; } } current->state = TASK_UNITERRUPTIBLE; add_wait_queue(&pkmap_map_wait, &wait); spin_unlock(&kmap_lock); schedule( ); remove_wait_queue(&pkmap_map_wait, &wait); spin_lock(&kmap_lock); if (page->virtual) return (unsigned long) page->virtual; } In t h e in n e r lo o p , t h e fu n ct io n s ca n s a ll co u n t e rs in pkmap_count t h a t a re lo o kin g fo r a n u ll va lu e . Th e

last_pkmap_nr va ria b le s t o re s t h e in d e x o f t h e la s t u s e d e n t ry in t h e pkmap_page_table Pa g e Ta b le . Th u s , t h e s e a rch s t a rt s fro m wh e re it wa s le ft in t h e la s t in vo ca t io n o f t h e map_new_virtual( ) fu n ct io n .

Wh e n t h e la s t co u n t e r in pkmap_count is re a ch e d , t h e s e a rch re s t a rt s fro m t h e co u n t e r a t in d e x 0 . Be fo re co n t in u in g , h o we ve r, map_new_virtual( ) in vo ke s t h e flush_all_zero_pkmaps( ) fu n ct io n , wh ich s t a rt s a n o t h e r s ca n n in g o f t h e co u n t e rs lo o kin g fo r t h e va lu e 1 . Ea ch co u n t e r t h a t h a s va lu e 1 d e n o t e s a n e n t ry in pkmap_page_table t h a t is fre e b u t ca n n o t b e u s e d b e ca u s e t h e co rre s p o n d in g TLB e n t ry h a s n o t ye t b e e n flu s h e d . flush_all_zero_pkmaps( ) is s u e s t h e TLB flu s h e s o n s u ch e n t rie s a n d re s e t s t h e ir co u n t e rs t o ze ro . If t h e in n e r lo o p ca n n o t fin d a n u ll co u n t e r in pkmap_count, t h e map_new_virtual( ) fu n ct io n b lo cks t h e cu rre n t p ro ce s s u n t il s o m e o t h e r p ro ce s s re le a s e s a n e n t ry o f t h e pkmap_page_table Pa g e Ta b le . Th is is a ch ie ve d b y in s e rt in g current in t h e pkmap_map_wait wa it q u e u e , s e t t in g t h e current s t a t e t o

TASK_UNINTERRUPTIBLE a n d in vo kin g schedule( ) t o re lin q u is h t h e CPU. On ce t h e p ro ce s s is a wo ke n , t h e fu n ct io n ch e cks wh e t h e r a n o t h e r p ro ce s s h a s m a p p e d t h e p a g e b y lo o kin g a t t h e virtual fie ld o f t h e p a g e d e s crip t o r; if s o m e o t h e r p ro ce s s h a s m a p p e d t h e p a g e , t h e in n e r lo o p is re s t a rt e d . Wh e n a n u ll co u n t e r is fo u n d b y t h e in n e r lo o p , t h e map_new_virtual( ) fu n ct io n :

1 . Co m p u t e s t h e lin e a r a d d re s s t h a t co rre s p o n d s t o t h e co u n t e r. 2 . Writ e s t h e p a g e 's p h ys ica l a d d re s s in t o t h e e n t ry in pkmap_page_table. Th e fu n ct io n a ls o s e t s t h e b it s Accessed, Dirty, Read/Write, a n d Present ( va lu e 0x63) in t h e s a m e e n t ry.

3 . S e t s t o 1 t h e pkmap_count co u n t e r.

4 . Writ e s t h e lin e a r a d d re s s in t o t h e virtual fie ld o f t h e p a g e d e s crip t o r.

5 . Re t u rn s t h e lin e a r a d d re s s . Th e kunmap( ) fu n ct io n d e s t ro ys a p e rm a n e n t ke rn e l m a p p in g . If t h e p a g e is re a lly in t h e h ig h m e m o ry zo n e , it in vo ke s t h e kunmap_high( ) fu n ct io n , wh ich is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g co d e :

void kunmap_high(struct page * page) { spin_lock(&kmap_lock); if ((--pkmap_count[((unsigned long) page->virtual-PKMAP_BASE)>>PAGE_SHIFT])==1) wake_up(&pkmap_map_wait); spin_unlock(&kmap_lock); } No t ice t h a t if t h e co u n t e r o f t h e Pa g e Ta b le e n t ry b e co m e s e q u a l t o 1 ( fre e ) , kunmap_high( ) wa ke s u p t h e p ro ce s s e s wa it in g in t h e pkmap_map_wait wa it q u e u e .

7.1.6.2 Temporary kernel mappings Te m p o ra ry ke rn e l m a p p in g s a re s im p le r t o im p le m e n t t h a n p e rm a n e n t ke rn e l m a p p in g s ; m o re o ve r, t h e y ca n b e u s e d in s id e in t e rru p t h a n d le rs a n d d e fe rra b le fu n ct io n s b e ca u s e t h e y n e ve r b lo ck t h e cu rre n t p ro ce s s . An y p a g e fra m e in h ig h m e m o ry ca n b e m a p p e d t h ro u g h a w in d o w in t h e ke rn e l a d d re s s s p a ce —n a m e ly, a Pa g e Ta b le e n t ry t h a t is re s e rve d fo r t h is p u rp o s e . Th e n u m b e r o f win d o ws re s e rve d fo r t e m p o ra ry ke rn e l m a p p in g s is q u it e s m a ll. Ea ch CPU h a s it s o wn s e t o f five win d o ws wh o s e lin e a r a d d re s s e s a re id e n t ifie d b y t h e enum km_type d a t a s t ru ct u re :

enum km_type { KM_BOUNCE_READ, KM_SKB_DATA, KM_SKB_DATA_SOFTIRQ, KM_USER0, KM_USER1, KM_TYPE_NR } Th e ke rn e l m u s t e n s u re t h a t t h e s a m e win d o w is n e ve r u s e d b y t wo ke rn e l co n t ro l p a t h s a t t h e s a m e t im e . Th u s , e a ch s ym b o l is n a m e d a ft e r t h e ke rn e l co m p o n e n t t h a t is a llo we d t o u s e t h e co rre s p o n d in g win d o w. Th e la s t s ym b o l, KM_TYPE_NR, d o e s n o t re p re s e n t a lin e a r a d d re s s b y it s e lf, b u t yie ld s t h e n u m b e r o f d iffe re n t win d o ws u s a b le b y e ve ry CPU. Ea ch s ym b o l in km_type, e xce p t t h e la s t o n e , is a n in d e x o f a fix- m a p p e d lin e a r a d d re s s ( s e e S e ct io n 2 . 5 . 6 ) . Th e enum fixed_addresses d a t a s t ru ct u re in clu d e s t h e s ym b o ls FIX_KMAP_BEGIN a n d

FIX_KMAP_END; t h e la t t e r is a s s ig n e d t o t h e in d e x FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1. In t h is m a n n e r, t h e re a re KM_TYPE_NR fix- m a p p e d lin e a r a d d re s s e s fo r e a ch CPU in t h e s ys t e m . Fu rt h e rm o re , t h e ke rn e l in it ia lize s t h e kmap_pte va ria b le wit h t h e a d d re s s o f t h e Pa g e Ta b le e n t ry co rre s p o n d in g t o t h e

fix_to_virt(FIX_KMAP_BEGIN) lin e a r a d d re s s . To e s t a b lis h a t e m p o ra ry ke rn e l m a p p in g , t h e ke rn e l in vo ke s t h e kmap_atomic( ) fu n ct io n , wh ich is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g co d e :

void * kmap_atomic(struct page * page, enum km_type type) { enum fixed_addresses idx; if (page < highmem_start_page) return page->virtual; idx = type + KM_TYPE_NR * smp_processor_id( ); set_pte(kmap_pte-idx, mk_pte(page, 0x063)); _ _flush_tlb_one(fix_to_virt(FIX_KMAP_BEGIN+idx)); } Th e type a rg u m e n t a n d t h e CPU id e n t ifie r s p e cify wh a t fix- m a p p e d lin e a r a d d re s s h a s t o b e u s e d t o m a p t h e re q u e s t p a g e . Th e fu n ct io n re t u rn s t h e lin e a r a d d re s s o f t h e p a g e fra m e if it d o e s n 't b e lo n g t o h ig h m e m o ry; o t h e rwis e , it s e t s u p t h e Pa g e Ta b le e n t ry co rre s p o n d in g t o t h e fix- m a p p e d lin e a r a d d re s s wit h t h e p a g e 's p h ys ica l a d d re s s a n d t h e b it s Present, Accessed, Read/Write, a n d Dirty. Fin a lly, t h e TLB e n t ry co rre s p o n d in g t o t h e lin e a r a d d re s s is flu s h e d . To d e s t ro y a t e m p o ra ry ke rn e l m a p p in g , t h e ke rn e l u s e s t h e kunmap_atomic( ) fu n ct io n . In t h e 8 0 x 8 6 a rch it e ct u re , h o we ve r, t h is fu n ct io n d o e s n o t h in g . Te m p o ra ry ke rn e l m a p p in g s s h o u ld b e u s e d ca re fu lly. A ke rn e l co n t ro l p a t h u s in g a t e m p o ra ry ke rn e l m a p p in g m u s t n e ve r b lo ck, b e ca u s e a n o t h e r ke rn e l co n t ro l p a t h m ig h t u s e t h e s a m e win d o w t o m a p s o m e o t h e r h ig h m e m o ry p a g e .

7.1.7 The Buddy System Algorithm Th e ke rn e l m u s t e s t a b lis h a ro b u s t a n d e fficie n t s t ra t e g y fo r a llo ca t in g g ro u p s o f co n t ig u o u s p a g e fra m e s . In d o in g s o , it m u s t d e a l wit h a we ll- kn o wn m e m o ry m a n a g e m e n t p ro b le m ca lle d e x t e rn a l fra g m e n t a t io n : fre q u e n t re q u e s t s a n d re le a s e s o f g ro u p s o f co n t ig u o u s p a g e fra m e s o f d iffe re n t s ize s m a y le a d t o a s it u a t io n in wh ich s e ve ra l s m a ll b lo cks o f fre e p a g e fra m e s a re "s ca t t e re d " in s id e b lo cks o f a llo ca t e d p a g e fra m e s . As a re s u lt , it m a y b e co m e im p o s s ib le t o a llo ca t e a la rg e b lo ck o f co n t ig u o u s p a g e fra m e s , e ve n if t h e re a re e n o u g h fre e p a g e s t o s a t is fy t h e re q u e s t . Th e re a re e s s e n t ia lly t wo wa ys t o a vo id e xt e rn a l fra g m e n t a t io n : ●



Us e t h e p a g in g circu it ry t o m a p g ro u p s o f n o n co n t ig u o u s fre e p a g e fra m e s in t o in t e rva ls o f co n t ig u o u s lin e a r a d d re s s e s . De ve lo p a s u it a b le t e ch n iq u e t o ke e p t ra ck o f t h e e xis t in g b lo cks o f fre e co n t ig u o u s p a g e fra m e s , a vo id in g a s m u ch a s p o s s ib le t h e n e e d t o s p lit u p a la rg e fre e b lo ck t o s a t is fy a re q u e s t fo r a s m a lle r o n e .

Th e s e co n d a p p ro a ch is p re fe rre d b y t h e ke rn e l fo r t h re e g o o d re a s o n s : ●





In s o m e ca s e s , co n t ig u o u s p a g e fra m e s a re re a lly n e ce s s a ry, s in ce co n t ig u o u s lin e a r a d d re s s e s a re n o t s u fficie n t t o s a t is fy t h e re q u e s t . A t yp ica l e xa m p le is a m e m o ry re q u e s t fo r b u ffe rs t o b e a s s ig n e d t o a DMA p ro ce s s o r ( s e e Ch a p t e r 1 3 ) . S in ce t h e DMA ig n o re s t h e p a g in g circu it ry a n d a cce s s e s t h e a d d re s s b u s d ire ct ly wh ile t ra n s fe rrin g s e ve ra l d is k s e ct o rs in a s in g le I/ O o p e ra t io n , t h e b u ffe rs re q u e s t e d m u s t b e lo ca t e d in co n t ig u o u s p a g e fra m e s . Eve n if co n t ig u o u s p a g e fra m e a llo ca t io n is n o t s t rict ly n e ce s s a ry, it o ffe rs t h e b ig a d va n t a g e o f le a vin g t h e ke rn e l p a g in g t a b le s u n ch a n g e d . Wh a t 's wro n g wit h m o d ifyin g t h e Pa g e Ta b le s ? As we kn o w fro m Ch a p t e r 2 , fre q u e n t Pa g e Ta b le m o d ifica t io n s le a d t o h ig h e r a ve ra g e m e m o ry a cce s s t im e s , s in ce t h e y m a ke t h e CPU flu s h t h e co n t e n t s o f t h e t ra n s la t io n lo o ka s id e b u ffe rs . La rg e ch u n ks o f co n t ig u o u s p h ys ica l m e m o ry ca n b e a cce s s e d b y t h e ke rn e l t h ro u g h 4 MB p a g e s . Th e re d u ct io n o f t ra n s la t io n lo o ka s id e b u ffe rs m is s e s , in co m p a ris o n t o t h e u s e o f 4 KB p a g e s , s ig n ifica n t ly s p e e d s u p t h e a ve ra g e m e m o ry a cce s s t im e ( s e e S e ct io n 2 . 4 . 8 ) .

Th e t e ch n iq u e a d o p t e d b y Lin u x t o s o lve t h e e xt e rn a l fra g m e n t a t io n p ro b le m is b a s e d o n t h e we ll- kn o wn b u d d y s y s t e m a lg o rit h m . All fre e p a g e fra m e s a re g ro u p e d in t o 1 0 lis t s o f b lo cks t h a t co n t a in g ro u p s o f 1 , 2 , 4 , 8 , 1 6 , 3 2 , 6 4 , 1 2 8 , 2 5 6 , a n d 5 1 2 co n t ig u o u s p a g e fra m e s , re s p e ct ive ly. Th e p h ys ica l a d d re s s o f t h e firs t p a g e fra m e o f a b lo ck is a m u lt ip le o f t h e g ro u p s ize —fo r e xa m p le , t h e in it ia l a d d re s s o f a 1 6 - p a g e fra m e b lo ck is a m u lt ip le o f 1 6 x 2 1 2 ( 2 1 2 = 4 , 0 9 6 , wh ich is t h e re g u la r p a g e s ize ) . We 'll s h o w h o w t h e a lg o rit h m wo rks t h ro u g h a s im p le e xa m p le . As s u m e t h e re is a re q u e s t fo r a g ro u p o f 1 2 8 co n t ig u o u s p a g e fra m e s ( i. e . , a h a lf- m e g a b yt e ) . Th e a lg o rit h m ch e cks firs t t o s e e wh e t h e r a fre e b lo ck in t h e 1 2 8 - p a g e - fra m e lis t e xis t s . If t h e re is n o s u ch b lo ck, t h e a lg o rit h m lo o ks fo r t h e n e xt la rg e r b lo ck—a fre e b lo ck in t h e 2 5 6 - p a g e - fra m e lis t . If s u ch a b lo ck e xis t s , t h e ke rn e l a llo ca t e s 1 2 8 o f t h e 2 5 6 p a g e fra m e s t o s a t is fy t h e re q u e s t a n d in s e rt s t h e re m a in in g 1 2 8 p a g e fra m e s in t o t h e lis t o f fre e 1 2 8 - p a g e - fra m e b lo cks . If t h e re is n o fre e 2 5 6 - p a g e b lo ck, t h e ke rn e l t h e n lo o ks fo r t h e n e xt la rg e r b lo ck ( i. e . , a fre e 5 1 2 - p a g e - fra m e b lo ck) . If s u ch a b lo ck e xis t s , it a llo ca t e s 1 2 8 o f t h e 5 1 2 p a g e fra m e s t o s a t is fy t h e re q u e s t , in s e rt s t h e firs t 2 5 6 o f t h e re m a in in g 3 8 4 p a g e fra m e s in t o t h e lis t o f fre e 2 5 6 - p a g e - fra m e b lo cks , a n d in s e rt s t h e la s t 1 2 8 o f t h e re m a in in g 3 8 4 p a g e fra m e s in t o t h e lis t o f fre e 1 2 8 - p a g e - fra m e b lo cks . If t h e lis t o f 5 1 2 - p a g e - fra m e b lo cks is e m p t y, t h e a lg o rit h m g ive s u p a n d s ig n a ls a n e rro r co n d it io n . Th e re ve rs e o p e ra t io n , re le a s in g b lo cks o f p a g e fra m e s , g ive s ris e t o t h e n a m e o f t h is a lg o rit h m . Th e ke rn e l a t t e m p t s t o m e rg e p a irs o f fre e b u d d y b lo cks o f s ize b t o g e t h e r in t o a s in g le b lo ck o f s ize 2 b . Two b lo cks a re co n s id e re d b u d d ie s if: ● ● ●

Bo t h b lo cks h a ve t h e s a m e s ize , s a y b . Th e y a re lo ca t e d in co n t ig u o u s p h ys ica l a d d re s s e s . Th e p h ys ica l a d d re s s o f t h e firs t p a g e fra m e o f t h e firs t b lo ck is a m u lt ip le o f 2 x b x 2 1 2 .

Th e a lg o rit h m is it e ra t ive ; if it s u cce e d s in m e rg in g re le a s e d b lo cks , it d o u b le s b a n d t rie s a g a in s o a s t o cre a t e e ve n b ig g e r b lo cks .

7.1.7.1 Data structures Lin u x u s e s a d iffe re n t b u d d y s ys t e m fo r e a ch zo n e . Th u s , in t h e 8 0 x 8 6 a rch it e ct u re , t h e re a re t h re e b u d d y s ys t e m s : t h e firs t h a n d le s t h e p a g e fra m e s s u it a b le fo r IS A DMA, t h e s e co n d h a n d le s t h e "n o rm a l" p a g e fra m e s , a n d t h e t h ird h a n d le s t h e h ig h - m e m o ry p a g e fra m e s . Ea ch b u d d y s ys t e m re lie s o n t h e fo llo win g m a in d a t a s t ru ct u re s : ●





Th e mem_map a rra y in t ro d u ce d p re vio u s ly. Act u a lly, e a ch zo n e is co n ce rn e d wit h a s u b s e t o f t h e

mem_map e le m e n t s . Th e firs t e le m e n t in t h e s u b s e t a n d it s n u m b e r o f e le m e n t s a re s p e cifie d , re s p e ct ive ly, b y t h e zone_mem_map a n d size fie ld s o f t h e zo n e d e s crip t o r. An a rra y h a vin g 1 0 e le m e n t s o f t yp e free_area_t, o n e e le m e n t fo r e a ch g ro u p s ize . Th e a rra y is s t o re d in t h e free_area fie ld o f t h e zo n e d e s crip t o r. Te n b in a ry a rra ys n a m e d b it m a p s , o n e fo r e a ch g ro u p s ize . Ea ch b u d d y s ys t e m h a s it s o wn s e t o f b it m a p s , wh ich it u s e s t o ke e p t ra ck o f t h e b lo cks it a llo ca t e s .

Th e free_area_t ( o r e q u iva le n t ly, struct free_area_struct) d a t a s t ru ct u re is d e fin e d a s fo llo ws :

typedef struct free_area_struct { struct list_head free_list; unsigned long *map; } free_area_t; Th e k t h e le m e n t o f t h e free_area a rra y in t h e zo n e d e s crip t o r is a s s o cia t e d wit h a d o u b ly lin ke d circu la r lis t o f b lo cks o f s ize 2 k ; e a ch m e m b e r o f s u ch a lis t is t h e d e s crip t o r o f t h e firs t p a g e fra m e o f a b lo ck. Th e lis t is im p le m e n t e d t h ro u g h t h e list fie ld o f t h e p a g e d e s crip t o r.

Th e map fie ld p o in t s t o a b it m a p wh o s e s ize d e p e n d s o n t h e n u m b e r o f p a g e fra m e s in t h e zo n e . Ea ch b it o f t h e b it m a p o f t h e k t h e n t ry o f t h e free_area a rra y d e s crib e s t h e s t a t u s o f t wo b u d d y b lo cks o f s ize 2 k p a g e fra m e s . If a b it o f t h e b it m a p is e q u a l t o 0 , e it h e r b o t h b u d d y b lo cks o f t h e p a ir a re fre e o r b o t h a re

b u s y; if it is e q u a l t o 1 , e xa ct ly o n e o f t h e b lo cks is b u s y. Wh e n b o t h b u d d ie s a re fre e , t h e ke rn e l t re a t s t h e m a s a s in g le fre e b lo ck o f s ize 2 k+ 1 . Le t 's co n s id e r, fo r s a ke o f illu s t ra t io n , a zo n e in clu d in g 1 2 8 MB o f RAM. Th e 1 2 8 MB ca n b e d ivid e d in t o 3 2 , 7 6 8 s in g le p a g e s , 1 6 , 3 8 4 g ro u p s o f 2 p a g e s e a ch , o r 8 , 1 9 2 g ro u p s o f 4 p a g e s e a ch , a n d s o o n u p t o 6 4 g ro u p s o f 5 1 2 p a g e s e a ch . S o t h e b it m a p co rre s p o n d in g t o free_area[0] co n s is t s o f 1 6 , 3 8 4 b it s , o n e fo r e a ch p a ir o f t h e 3 2 , 7 6 8 e xis t in g p a g e fra m e s ; t h e b it m a p co rre s p o n d in g t o free_area[1] co n s is t s o f 8 , 1 9 2 b it s , o n e fo r e a ch p a ir o f b lo cks o f t wo co n s e cu t ive p a g e fra m e s ; t h e la s t b it m a p co rre s p o n d in g t o free_area[9] co n s is t s o f 3 2 b it s , o n e fo r e a ch p a ir o f b lo cks o f 5 1 2 co n t ig u o u s p a g e fra m e s .

Fig u re 7 - 2 illu s t ra t e s t h e u s e o f t h e d a t a s t ru ct u re s in t ro d u ce d b y t h e b u d d y s ys t e m a lg o rit h m . Th e a rra y

zone_mem_map co n t a in s n in e fre e p a g e fra m e s g ro u p e d in o n e b lo ck o f o n e ( a s in g le p a g e fra m e ) a t t h e t o p a n d t wo b lo cks o f fo u r fu rt h e r d o wn . Th e d o u b le a rro ws d e n o t e d o u b ly lin ke d circu la r lis t s im p le m e n t e d b y t h e free_list fie ld . No t ice t h a t t h e b it m a p s a re n o t d ra wn t o s ca le . Fig u re 7 - 2 . D a t a s t ru c t u re s u s e d b y t h e b u d d y s y s t e m

7.1.7.2 Allocating a block Th e alloc_pages( ) fu n ct io n is t h e co re o f t h e b u d d y s ys t e m a llo ca t io n a lg o rit h m . An y o t h e r a llo ca t o r fu n ct io n o r m a cro lis t e d in t h e e a rlie r s e ct io n S e ct io n 7 . 1 . 5 e n d s u p in vo kin g alloc_pages( ).

Th e fu n ct io n co n s id e rs t h e lis t o f t h e contig_page_data.node_zonelists a rra y co rre s p o n d in g t o t h e zo n e m o d ifie rs s p e cifie d in t h e a rg u m e n t gfp_mask. S t a rt in g wit h t h e firs t zo n e d e s crip t o r in t h e lis t , it co m p a re s t h e n u m b e r o f fre e p a g e fra m e s in t h e zo n e ( s t o re d in t h e free_pages fie ld o f t h e zo n e d e s crip t o r) , t h e n u m b e r o f re q u e s t e d p a g e fra m e s ( a rg u m e n t order o f alloc_pages( )) , a n d t h e t h re s h o ld va lu e s t o re d in t h e pages_low fie ld o f t h e zo n e d e s crip t o r. If free_pages - 2 o rd e r is s m a lle r t h a n o r e q u a l t o pages_low, t h e fu n ct io n s kip s t h e zo n e a n d co n s id e rs t h e n e xt zo n e in t h e lis t . If n o zo n e h a s e n o u g h fre e p a g e fra m e s , alloc_pages( ) re s t a rt s t h e lo o p , t h is t im e lo o kin g fo r a zo n e t h a t h a s a t le a s t pages_min fre e p a g e fra m e s . If s u ch a zo n e d o e s n 't e xis t a n d if t h e cu rre n t p ro ce s s is a llo we d t o wa it , t h e fu n ct io n in vo ke s balance_classzone( ), wh ich in t u rn in vo ke s try_to_free_pages( ) t o re cla im e n o u g h p a g e fra m e s t o s a t is fy t h e m e m o ry re q u e s t ( s e e S e ct io n 1 6 . 7 ) . Wh e n alloc_pages( ) fin d s a zo n e wit h a s u it a b le n u m b e r o f fre e p a g e fra m e s , it in vo ke s t h e rmqueue(

) fu n ct io n t o a llo ca t e a b lo ck in t h a t zo n e . Th e fu n ct io n t a ke s t wo a rg u m e n t s : t h e a d d re s s o f t h e zo n e d e s crip t o r, a n d order, wh ich d e n o t e s t h e lo g a rit h m o f t h e s ize o f t h e re q u e s t e d b lo ck o f fre e p a g e s ( 0 fo r a

o n e - p a g e b lo ck, 1 fo r a t wo - p a g e b lo ck, a n d s o fo rt h ) . If t h e p a g e fra m e s a re s u cce s s fu lly a llo ca t e d , t h e rmqueue( ) fu n ct io n re t u rn s t h e a d d re s s o f t h e p a g e d e s crip t o r o f t h e firs t a llo ca t e d p a g e fra m e ; t h a t a d d re s s is a ls o re t u rn e d b y alloc_pages( ). Ot h e rwis e , rmqueue( ) re t u rn s NULL, a n d alloc_pages(

) co n s id e r t h e n e xt zo n e in t h e lis t . Th e rmqueue( ) fu n ct io n is e q u iva le n t t o t h e fo llo win g fra g m e n t s . Firs t , a fe w lo ca l va ria b le s a re d e cla re d a n d in it ia lize d :

free_area_t * area = &(zone->free_area[order]); unsigned int curr_order = order; struct list_head *head, *curr; struct page *page; unsigned long flags; unsigned int index; unsigned long size; Th e fu n ct io n d is a b le s in t e rru p t s a n d a cq u ire s t h e s p in lo ck o f t h e zo n e d e s crip t o r b e ca u s e it will a lt e r it s fie ld s t o a llo ca t e a b lo ck. Th e n it p e rfo rm s a cyclic s e a rch t h ro u g h e a ch lis t fo r a n a va ila b le b lo ck ( d e n o t e d b y a n e n t ry t h a t d o e s n 't p o in t t o t h e e n t ry it s e lf) , s t a rt in g wit h t h e lis t fo r t h e re q u e s t e d order a n d co n t in u in g if n e ce s s a ry t o la rg e r o rd e rs . Th is is e q u iva le n t t o t h e fo llo win g fra g m e n t :

spin_lock_irqsave(&zone->lock, flags); do { head = &area->free_list; curr = head->next; if (curr != head) goto block_found; curr_order++; area++; } while (curr_order < 10); spin_unlock_irqrestore(&zone->lock, flags); return NULL; If t h e lo o p t e rm in a t e s , n o s u it a b le fre e b lo ck h a s b e e n fo u n d , s o rmqueue( ) re t u rn s a NULL va lu e . Ot h e rwis e , a s u it a b le fre e b lo ck h a s b e e n fo u n d ; in t h is ca s e , t h e d e s crip t o r o f it s firs t p a g e fra m e is re m o ve d fro m t h e lis t , t h e co rre s p o n d in g b it m a p is u p d a t e d , a n d t h e va lu e o f free_ pages in t h e zo n e d e s crip t o r is d e cre a s e d :

block_found: page = list_entry(curr, struct page, list); list_del(curr); index = page - zone->zone_mem_map; if (curr_order != 9) change_bit(index>>(1+curr_order), area->map); zone->free_pages -= 1UL >= 1; /* insert *page as first element in the list and update the bitmap */ list_add(&page->list, &area->free_list); change_bit(index >> (1+curr_order), area->map); /* now take care of the second half of the free block starting at *page */

index += size; page += size; } Fin a lly, rmqueue( ) re le a s e s t h e s p in lo ck, u p d a t e s t h e count fie ld o f t h e p a g e d e s crip t o r a s s o cia t e d wit h t h e s e le ct e d b lo ck, a n d e xe cu t e s t h e return in s t ru ct io n :

spin_unlock_irqrestore(&zone->lock, flags); atomic_set(&page->count, 1); return page; As a re s u lt , t h e alloc_pages( ) fu n ct io n re t u rn s t h e a d d re s s o f t h e p a g e d e s crip t o r o f t h e firs t p a g e fra m e a llo ca t e d .

7.1.7.3 Freeing a block Th e _ _free_pages_ok( ) fu n ct io n im p le m e n t s t h e b u d d y s ys t e m s t ra t e g y fo r fre e in g p a g e fra m e s . It u s e s t wo in p u t p a ra m e t e rs :

page Th e a d d re s s o f t h e d e s crip t o r o f t h e firs t p a g e fra m e in clu d e d in t h e b lo ck t o b e re le a s e d

order Th e lo g a rit h m ic s ize o f t h e b lo ck Th e _ _free_pages_ok( ) fu n ct io n u s u a lly in s e rt s t h e b lo ck o f p a g e fra m e s in t h e b u d d y s ys t e m d a t a s t ru ct u re s s o t h e y ca n b e u s e d in s u b s e q u e n t a llo ca t io n re q u e s t s . On e ca s e is a n e xce p t io n : if t h e cu rre n t p ro ce s s is m o vin g p a g e s a cro s s m e m o ry zo n e s t o re b a la n ce t h e m , t h e fu n ct io n d o e s n o t fre e t h e p a g e fra m e s , b u t in s e rt s t h e b lo ck in a s p e cia l lis t o f t h e p ro ce s s . To d e cid e wh a t t o d o wit h t h e b lo ck o f p a g e fra m e s , _ _free_pages_ok( ) ch e cks t h e PF_FREE_PAGES fla g o f t h e p ro ce s s . It is s e t o n ly if t h e p ro ce s s is re b a la n cin g t h e m e m o ry zo n e s . An ywa y, we d is cu s s t h is s p e cia l ca s e in Ch a p t e r 1 6 ; we a s s u m e h e re t h a t t h e PF_FREE_PAGES fla g o f current is n o t s e t , s o _ _free_pages_ok( ) in s e rt s t h e b lo ck in t h e b u d d y s ys t e m d a t a s t ru ct u re s . Th e fu n ct io n s t a rt s b y d e cla rin g a n d in it ia lizin g a fe w lo ca l va ria b le s :

unsigned long unsigned long zone_t * zone struct page * free_area_t * unsigned long unsigned long struct page *

flags; mask = (~0UL) zone; base = zone->zone_mem_map; area = &zone->free_area[order]; page_idx = page - base; index = page_idx >> (1 + order); buddy;

Th e page_idx lo ca l va ria b le co n t a in s t h e in d e x o f t h e firs t p a g e fra m e in t h e b lo ck wit h re s p e ct t o t h e firs t p a g e fra m e o f t h e zo n e . Th e index lo ca l va ria b le co n t a in s t h e b it n u m b e r co rre s p o n d in g t o t h e b lo ck in t h e b it m a p . Th e fu n ct io n cle a rs t h e PG_referenced a n d PG_dirty fla g s o f t h e firs t p a g e fra m e , t h e n it a cq u ire s t h e zo n e s p in lo ck a n d d is a b le s in t e rru p t s :

page->flags &= ~((1slabs_partial.next; if (entry == & cachep->slabs_partial) { entry = cachep->slabs_free.next; if (entry == & cachep->slabs_free) goto alloc_new_slab; list_del(entry); list_add(entry, & cachep->slab_partials); } slabp = list_entry(entry, slab_t, list); Th e fu n ct io n lo o ks in t o t h e slabs_partial d o u b ly lin ke d lis t , wh ich lin ks t h e d e s crip t o rs o f t h e s la b t h a t h a s a t le a s t o n e fre e o b je ct . If t h e lis t is e m p t y ( t h e lis t h e a d p o in t s t o it s e lf) , t h e fu n ct io n m o ve s t h e firs t s la b d e s crip t o r in t h e slabs_free lis t in t o slabs_partial. If e ve n slabs_free is e m p t y, t h e fu n ct io n a llo ca t e s a n e w s la b t o t h e ca ch e b y in vo kin g kmem_cache_grow( ), a n d t h e n t h e wh o le p ro ce d u re is re p e a t e d . Aft e r o b t a in in g a s la b wit h a fre e o b je ct , t h e fu n ct io n in cre m e n t s t h e co u n t e r co n t a in in g t h e n u m b e r o f o b je ct s cu rre n t ly a llo ca t e d in t h e s la b :

slabp->inuse++; It t h e n lo a d s t h e objp lo ca l va ria b le wit h t h e a d d re s s o f t h e firs t fre e o b je ct in s id e t h e s la b :

objp = & slabp->s_mem[slabp->free * cachep->objsize];

Th e kmem_cache_alloc( ) fu n ct io n t h e n u p d a t e s t h e slabp->free fie ld o f t h e s la b d e s crip t o r t o p o in t t o t h e n e xt fre e o b je ct :

slabp->free = ((kmem_bufctl_*)(slabp+1))[slabp->free]; if (slabp->free == BUFCTL_END) { list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_full); } Re ca ll t h a t t h e o b je ct d e s crip t o r o f a fre e o b je ct s t o re s e it h e r t h e in d e x o f a n o t h e r fre e o b je ct in t h e s la b , o r it s t o re s BUFCTL_END if t h e fre e o b je ct is t h e la s t o n e . Th e a rra y o f o b je ct d e s crip t o rs is p la ce d rig h t a ft e r t h e s la b d e s crip t o r, s o it s a d d re s s is co m p u t e d a s (slabp+1).

Th e fu n ct io n t e rm in a t e s b y re - e n a b lin g t h e lo ca l in t e rru p t s a n d re t u rn in g t h e a d d re s s o f t h e n e w o b je ct :

local_irq_restore(save_flags); return objp; 7.2.12.2 The multiprocessor case

kmem_cache_alloc( ) firs t d is a b le s t h e lo ca l in t e rru p t s ; it t h e n lo o ks fo r a fre e o b je ct in t h e ca ch e 's lo ca l a rra y a s s o cia t e d wit h t h e ru n n in g CPU:

void * objp; cpucache_t * cc; local_irq_save(save_flags); cc = cachep->cpudata[smp_processor_id( )]; if (cc->avail) objp = ((void *)(cc+1))[--cc->avail]; else { objp = kmem_cache_alloc_batch(cachep, flags); if (!objp) goto alloc_new_slab; } local_irq_restore(save_flags); return objp; Th e cc lo ca l va ria b le co n t a in s t h e a d d re s s o f t h e lo ca l a rra y d e s crip t o r; t h u s , (cc+1) yie ld s t h e a d d re s s o f t h e firs t lo ca l a rra y e le m e n t . If t h e avail fie ld o f t h e lo ca l a rra y d e s crip t o r is p o s it ive , t h e fu n ct io n lo a d s t h e a d d re s s o f t h e co rre s p o n d in g o b je ct in t o t h e objp lo ca l va ria b le a n d d e cre m e n t s t h e co u n t e r. Ot h e rwis e , it in vo ke s kmem_cache_alloc_batch( ) t o re p o p u la t e t h e lo ca l a rra y.

Th e kmem_cache_alloc_batch( ) fu n ct io n g e t s t h e ca ch e s p in lo ck a n d t h e n a llo ca t e s a p re d e fin e d n u m b e r o f o b je ct s fro m t h e ca ch e a n d in s e rt s t h e m in t o t h e lo ca l a rra y:

count = cachep->batchcount; cc = cachep->cpudata[smp_processor_id( )]; spin_lock(&cachep->spinlock); while (count--) { entry = cachep->slabs_partial.next; if (entry == &cachep->slabs_partial) { entry = cachep->slabs_free.next; if (entry == slabs_free) break; list_del(entry); list_add(entry, &cachep->slabs_partial);

} slabp = list_entry(entry, slab_t, list); slabp->inuse++; objp = & slabp->s_mem[slabp->free * cachep->objsize]; slabp->free = ((kmem_bufctl_*)(((slab_t *)slabp)+1))[slabp->free]; if (slabp->free == BUFCTL_END) { list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_full); } ((void *)(cc+1))[cc->avail++] = objp; } spin_unlock(&cachep->spinlock); if (cc->avail) return ((void *)(cc+1))[--cc->avail]; return NULL; Th e n u m b e r o f p re - a llo ca t e d o b je ct s is s t o re d in t h e batchcount fie ld o f t h e ca ch e d e s crip t o r; b y d e fa u lt , it is h a lf o f t h e lo ca l a rra y s ize , b u t t h e s ys t e m a d m in is t ra t o r ca n m o d ify it b y writ in g in t o t h e / p ro c/ s la b in fo file . Th e co d e t h a t g e t s t h e o b je ct s fro m t h e s la b s is id e n t ica l t o t h a t o f t h e u n ip ro ce s s o r ca s e , s o we wo n 't d is cu s s it fu rt h e r. Th e kmem_cache_alloc_batch( ) fu n ct io n re t u rn s NULL if t h e ca ch e d o e s n o t h a ve fre e o b je ct s . In t h is ca s e , kmem_cache_alloc( ) in vo ke s kmem_cache_grow( ) a n d t h e n re p e a t s t h e wh o le p ro ce d u re , ( a s in t h e u n ip ro ce s s o r ca s e ) .

7.2.13 Releasing an Object from a Cache Th e kmem_cache_free( ) fu n ct io n re le a s e s a n o b je ct p re vio u s ly o b t a in e d b y t h e s la b a llo ca t o r. It s p a ra m e t e rs a re cachep ( t h e a d d re s s o f t h e ca ch e d e s crip t o r) a n d objp ( t h e a d d re s s o f t h e o b je ct t o b e re le a s e d ) . As wit h kmem_cache_alloc( ), we d is cu s s t h e u n ip ro ce s s o r ca s e s e p a ra t e ly fro m t h e m u lt ip ro ce s s o r ca s e .

7.2.13.1 The uniprocessor case Th e fu n ct io n s t a rt s b y d is a b lin g t h e lo ca l in t e rru p t s a n d t h e n d e t e rm in e s t h e a d d re s s o f t h e d e s crip t o r o f t h e s la b co n t a in in g t h e o b je ct . It u s e s t h e list.prev s u b fie ld o f t h e d e s crip t o r o f t h e p a g e fra m e s t o rin g t h e o b je ct :

slab_t * slabp; unsigned int objnr; local_irq_save(save_flags); slabp = (slab_t *) mem_map[_ _pa(objp) >> PAGE_SHIFT].list.prev; Th e n t h e fu n ct io n co m p u t e s t h e in d e x o f t h e o b je ct in s id e it s s la b , d e rive s t h e a d d re s s o f it s o b je ct d e s crip t o r, a n d a d d s t h e o b je ct t o t h e h e a d o f t h e s la b 's lis t o f fre e o b je ct s :

objnr = (objp - slabp->s_mem) / cachep->objsize; ((kmem_bufctl_t *)(slabp+1))[objnr] = slabp->free; slabp->free = objnr; Fin a lly, t h e fu n ct io n ch e cks wh e t h e r t h e s la b h a s t o b e m o ve d t o a n o t h e r lis t :

if (--slabp->inuse == 0) { /* slab is now fully free */ list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_free); } else if (slabp->inuse+1 == cachep->num) { /* slab was full */ list_del(&slabp->list);

list_add(&slabp->list, &cachep->slabs_partial); } local_irq_restore(save_flags); return; 7.2.13.2 The multiprocessor case Th e fu n ct io n s t a rt s b y d is a b lin g t h e lo ca l in t e rru p t s ; t h e n it ch e cks wh e t h e r t h e re is a fre e s lo t in t h e lo ca l a rra y o f o b je ct p o in t e rs :

cpucache_t * cc; local_irq_save(save_flags); cc = cachep->cpudata[smp_processor_id( )]; if (cc->avail == cc->limit) { cc->avail -= cachep->batchcount; free_block(cachep, &((void *)(cc+1))[cc->avail], cachep->batchcount); } ((void *)(cc+1))[cc->avail++] = objp; local_irq_restore(save_flags); return; If t h e re is a t le a s t o n e fre e s lo t in t h e lo ca l a rra y, t h e fu n ct io n ju s t s e t s it t o t h e a d d re s s o f t h e o b je ct b e in g fre e d . Ot h e rwis e , t h e fu n ct io n in vo ke s free_block( ) t o re le a s e a b u n ch o f cachep-

>batchcount o b je ct s t o t h e s la b a llo ca t o r ca ch e . Th e free_block(cachep,objpp,len) fu n ct io n a cq u ire s t h e ca ch e s p in lo ck a n d t h e n re le a s e s len o b je ct s s t a rt in g fro m t h e lo ca l a rra y e n t ry a t a d d re s s objpp:

spin_lock(&cachep->spinlock); for ( ; len > 0; len--, objpp++) { slab_t * slabp = (slab_t *) mem_map[_ _pa(*objpp) >> PAGE_SHIFT].list.prev; unsigned int objnr = (*objpp - slabp->s_mem) / cachep->objsize; ((kmem_bufctl_t *)(slabp+1))[objnr] = slabp->free; slabp->free = objnr; if (--slabp->inuse == 0) { /* slab is now fully free */ list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_free); } else if (slabp->inuse+1 == cachep->num) { /* slab was full */ list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_partial); } } spin_unlock(&cachep->spinlock); Th e co d e t h a t re le a s e s t h e o b je ct s t o t h e s la b s is id e n t ica l t o t h a t o f t h e u n ip ro ce s s o r ca s e , s o we d o n 't d is cu s s it fu rt h e r.

7.2.14 General Purpose Objects As s t a t e d e a rlie r in S e ct io n 7 . 1 . 7 , in fre q u e n t re q u e s t s fo r m e m o ry a re a s a re h a n d le d t h ro u g h a g ro u p o f g e n e ra l ca ch e s wh o s e o b je ct s h a ve g e o m e t rica lly d is t rib u t e d s ize s ra n g in g fro m a m in im u m o f 3 2 t o a m a xim u m o f 1 3 1 , 0 7 2 b yt e s . Ob je ct s o f t h is t yp e a re o b t a in e d b y in vo kin g t h e kmalloc( ) fu n ct io n :

void * kmalloc(size_t size, int flags) {

cache_sizes_t *csizep = cache_sizes; kmem_cache_t * cachep; for (; csizep->cs_size; csizep++) { if (size > csizep->cs_size) continue; if (flags & _ _GFP_DMA) cachep = csizep->cs_dmacahep; else cachep = csizep->cs_cachep; return _ _kmem_cache_alloc(cachep, flags); } return NULL; } Th e fu n ct io n u s e s t h e cache_sizes t a b le t o lo ca t e t h e n e a re s t p o we r- o f- 2 s ize t o t h e re q u e s t e d s ize . It t h e n ca lls kmem_cache_alloc( ) t o a llo ca t e t h e o b je ct , p a s s in g t o it e it h e r t h e ca ch e d e s crip t o r fo r t h e p a g e fra m e s u s a b le fo r IS A DMA o r t h e ca ch e d e s crip t o r fo r t h e "n o rm a l" p a g e fra m e s , d e p e n d in g o n wh e t h e r t h e ca lle r s p e cifie d t h e _ _GFP_DMA fla g . [ 3 ]

[3]

Act u a lly, fo r m o re e fficie n cy, t h e co d e o f kmem_cache_alloc( ) is co p ie d in s id e t h e b o d y o f kmalloc( ). Th e _ _kmem_cache alloc( ) fu n ct io n , wh ich im p le m e n t s kmem_cache_alloc( ), is d e cla re d inline.

Ob je ct s o b t a in e d b y in vo kin g kmalloc( ) ca n b e re le a s e d b y ca llin g kfree( ):

void kfree(const void *objp) { kmem_cache_t * c; unsigned long flags; if (!objp) return; local_irq_save(flags); c = (kmem_cache_t *) mem_map[_ _pa(objp) >> PAGE_SHIFT].list.next; _ _kmem_cache_free(c, (void *) objp); local_irq_restore(flags); } Th e p ro p e r ca ch e d e s crip t o r is id e n t ifie d b y re a d in g t h e list.next s u b fie ld o f t h e d e s crip t o r o f t h e firs t p a g e fra m e co n t a in in g t h e m e m o ry a re a . Th e m e m o ry a re a is re le a s e d b y in vo kin g

kmem_cache_free( ). [4] [4]

As fo r kmalloc( ), t h e co d e o f kmem_cache_free( ) is co p ie d in s id e kfree( ). _ _kmem_cache_free( ), wh ich im p le m e n t s kmem_cache_free( ), is d e cla re d inline. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

7.3 Noncontiguous Memory Area Management We a lre a d y kn o w t h a t it is p re fe ra b le t o m a p m e m o ry a re a s in t o s e t s o f co n t ig u o u s p a g e fra m e s , t h u s m a kin g b e t t e r u s e o f t h e ca ch e a n d a ch ie vin g lo we r a ve ra g e m e m o ry a cce s s t im e s . Ne ve rt h e le s s , if t h e re q u e s t s fo r m e m o ry a re a s a re in fre q u e n t , it m a ke s s e n s e t o co n s id e r a n a llo ca t io n s ch e m a b a s e d o n n o n co n t ig u o u s p a g e fra m e s a cce s s e d t h ro u g h co n t ig u o u s lin e a r a d d re s s e s . Th e m a in a d va n t a g e o f t h is s ch e m a is t o a vo id e xt e rn a l fra g m e n t a t io n , wh ile t h e d is a d va n t a g e is t h a t it is n e ce s s a ry t o fid d le wit h t h e ke rn e l Pa g e Ta b le s . Cle a rly, t h e s ize o f a n o n co n t ig u o u s m e m o ry a re a m u s t b e a m u lt ip le o f 4 , 0 9 6 . Lin u x u s e s n o n co n t ig u o u s m e m o ry a re a s in s e ve ra l wa ys — fo r in s t a n ce , t o a llo ca t e d a t a s t ru ct u re s fo r a ct ive s wa p a re a s ( s e e S e ct io n 1 6 . 2 . 3 ) , t o a llo ca t e s p a ce fo r a m o d u le ( s e e Ap p e n d ix B) , o r t o a llo ca t e b u ffe rs t o s o m e I/ O d rive rs .

7.3.1 Linear Addresses of Noncontiguous Memory Areas To fin d a fre e ra n g e o f lin e a r a d d re s s e s , we ca n lo o k in t h e a re a s t a rt in g fro m PAGE_OFFSET ( u s u a lly 0xc0000000, t h e b e g in n in g o f t h e fo u rt h g ig a b yt e ) . Fig u re 7 - 7 s h o ws h o w t h e fo u rt h g ig a b yt e lin e a r a d d re s s e s a re u s e d : ●

Th e b e g in n in g o f t h e a re a in clu d e s t h e lin e a r a d d re s s e s t h a t m a p t h e firs t 8 9 6 MB o f RAM ( s e e S e ct io n 2 . 5 . 4 ) ; t h e lin e a r a d d re s s t h a t co rre s p o n d s t o t h e e n d o f t h e d ire ct ly m a p p e d p h ys ica l m e m o ry is s t o re d in t h e high_memory va ria b le .



Th e e n d o f t h e a re a co n t a in s t h e fix- m a p p e d lin e a r a d d re s s e s ( s e e S e ct io n 2 . 5 . 6 ) .



S t a rt in g fro m PKMAP_BASE ( 0xfe000000) , we fin d t h e lin e a r a d d re s s e s u s e d fo r t h e



p e rs is t e n t ke rn e l m a p p in g o f h ig h - m e m o ry p a g e fra m e s ( s e e S e ct io n 7 . 1 . 6 e a rlie r in t h is ch a p t e r) . Th e re m a in in g lin e a r a d d re s s e s ca n b e u s e d fo r n o n co n t ig u o u s m e m o ry a re a s . A s a fe t y in t e rva l o f s ize 8 MB ( m a cro VMALLOC_OFFSET) is in s e rt e d b e t we e n t h e e n d o f t h e p h ys ica l m e m o ry m a p p in g a n d t h e firs t m e m o ry a re a ; it s p u rp o s e is t o "ca p t u re " o u t - o f- b o u n d s m e m o ry a cce s s e s . Fo r t h e s a m e re a s o n , a d d it io n a l s a fe t y in t e rva ls o f s ize 4 KB a re in s e rt e d t o s e p a ra t e n o n co n t ig u o u s m e m o ry a re a s . Fig u re 7 - 7 . Th e lin e a r a d d re s s in t e rv a l s t a rt in g fro m P AGE_ OFFS ET

Th e VMALLOC_START m a cro d e fin e s t h e s t a rt in g a d d re s s o f t h e lin e a r s p a ce re s e rve d fo r n o n co n t ig u o u s m e m o ry a re a s , wh ile VMALLOC_END d e fin e s it s e n d in g a d d re s s .

7.3.2 Descriptors of Noncontiguous Memory Areas Ea ch n o n co n t ig u o u s m e m o ry a re a is a s s o cia t e d wit h a d e s crip t o r o f t yp e struct

vm_struct: struct vm_struct { unsigned long flags; void * addr; unsigned long size; struct vm_struct * next; }; Th e s e d e s crip t o rs a re in s e rt e d in a s im p le lis t b y m e a n s o f t h e next fie ld ; t h e a d d re s s o f t h e firs t e le m e n t o f t h e lis t is s t o re d in t h e vmlist va ria b le . Acce s s e s t o t h is lis t a re p ro t e ct e d b y m e a n s o f t h e vmlist_lock re a d / writ e s p in lo ck. Th e addr fie ld co n t a in s t h e lin e a r a d d re s s o f t h e firs t m e m o ry ce ll o f t h e a re a ; t h e size fie ld co n t a in s t h e s ize o f t h e a re a p lu s 4 , 0 9 6 ( wh ich is t h e s ize o f t h e p re vio u s ly m e n t io n e d in t e r- a re a s a fe t y in t e rva l) . Th e get_vm_area( ) fu n ct io n cre a t e s n e w d e s crip t o rs o f t yp e struct vm_struct; it s p a ra m e t e r size s p e cifie s t h e s ize o f t h e n e w m e m o ry a re a . Th e fu n ct io n is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g :

struct vm_struct * get_vm_area(unsigned long size, unsigned long flags) { unsigned long addr; struct vm_struct **p, *tmp, *area; area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); if (!area) return NULL; size += PAGE_SIZE; addr = VMALLOC_START; write_lock(&vmlist_lock); for (p = &vmlist; (tmp = *p) ; p = &tmp->next) { if (size + addr addr) { area->flags = flags; area->addr = (void *) addr; area->size = size; area->next = *p; *p = area; write_unlock(&vmlist_lock); return area; } addr = tmp->size + (unsigned long) tmp->addr; if (addr + size > VMALLOC_END) { write_unlock(&vmlist_lock); kfree(area); return NULL; } } } Th e fu n ct io n firs t ca lls kmalloc( ) t o o b t a in a m e m o ry a re a fo r t h e n e w d e s crip t o r. It t h e n s ca n s t h e lis t o f d e s crip t o rs o f t yp e struct vm_struct lo o kin g fo r a n a va ila b le ra n g e o f lin e a r a d d re s s e s t h a t in clu d e s a t le a s t size+4096 a d d re s s e s . If s u ch a n in t e rva l e xis t s , t h e fu n ct io n in it ia lize s t h e fie ld s o f t h e d e s crip t o r a n d t e rm in a t e s b y re t u rn in g t h e in it ia l a d d re s s o f t h e n o n co n t ig u o u s m e m o ry a re a . Ot h e rwis e , wh e n addr + size e xce e d s VMALLOC_END,

get_vm_area( ) re le a s e s t h e d e s crip t o r a n d re t u rn s NULL. 7.3.3 Allocating a Noncontiguous Memory Area Th e vmalloc( ) fu n ct io n a llo ca t e s a n o n co n t ig u o u s m e m o ry a re a t o t h e ke rn e l. Th e p a ra m e t e r size d e n o t e s t h e s ize o f t h e re q u e s t e d a re a . If t h e fu n ct io n is a b le t o s a t is fy t h e re q u e s t , it t h e n re t u rn s t h e in it ia l lin e a r a d d re s s o f t h e n e w a re a ; o t h e rwis e , it re t u rn s a NULL p o in t e r:

void * vmalloc(unsigned long size) { void * addr; struct vm_struct *area; size = (size + PAGE_SIZE - 1) & PAGE_MASK; area = get_vm_area(size, VM_ALLOC); if (!area) return NULL; addr = area->addr; if (vmalloc_area_pages((unsigned long) addr, size, GFP_KERNEL|_ _GFP_HIGHMEM, 0x63)) { vfree(addr); return NULL; } return addr; } Th e fu n ct io n s t a rt s b y ro u n d in g u p t h e va lu e o f t h e size p a ra m e t e r t o a m u lt ip le o f 4 , 0 9 6 ( t h e p a g e fra m e s ize ) . Th e n vmalloc( ) in vo ke s get_vm_area( ), wh ich cre a t e s a n e w d e s crip t o r a n d re t u rn s t h e lin e a r a d d re s s e s a s s ig n e d t o t h e m e m o ry a re a . Th e flags fie ld o f t h e d e s crip t o r is in it ia lize d wit h t h e VM_ALLOC fla g , wh ich m e a n s t h a t t h e lin e a r a d d re s s ra n g e is g o in g t o b e u s e d fo r a n o n co n t ig u o u s m e m o ry a llo ca t io n ( we 'll s e e in Ch a p t e r 1 3 t h a t vm_struct d e s crip t o rs a re a ls o u s e d t o re m a p m e m o ry o n h a rd wa re d e vice s ) . Th e n t h e vmalloc( ) fu n ct io n in vo ke s vmalloc_area_pages( ) t o re q u e s t n o n co n t ig u o u s p a g e fra m e s a n d t e rm in a t e s b y re t u rn in g t h e in it ia l lin e a r a d d re s s o f t h e n o n co n t ig u o u s m e m o ry a re a . Th e vmalloc_area_pages( ) fu n ct io n u s e s fo u r p a ra m e t e rs :

address Th e in it ia l lin e a r a d d re s s o f t h e a re a .

size Th e s ize o f t h e a re a .

gfp_mask Th e a llo ca t io n fla g s p a s s e d t o t h e b u d d y s ys t e m a llo ca t o r fu n ct io n . It is a lwa ys s e t t o GFP_KERNEL|_ _GFP_HIGHMEM.

prot Th e p ro t e ct io n b it s o f t h e a llo ca t e d p a g e fra m e s . It is a lwa ys s e t t o 0x63, wh ich co rre s p o n d s t o Present, Accessed, Read/Write, a n d Dirty.

Th e fu n ct io n s t a rt s b y a s s ig n in g t h e lin e a r a d d re s s o f t h e e n d o f t h e a re a t o t h e end lo ca l va ria b le :

end = address + size; Th e fu n ct io n t h e n u s e s t h e pgd_offset_k m a cro t o d e rive t h e e n t ry in t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry re la t e d t o t h e in it ia l lin e a r a d d re s s o f t h e a re a ; it t h e n a cq u ire s t h e ke rn e l Pa g e Ta b le s p in lo ck:

dir = pgd_offset_k(address); spin_lock(&init_mm.page_table_lock); Th e fu n ct io n t h e n e xe cu t e s t h e fo llo win g cycle :

while (address < end) { pmd_t *pmd = pmd_alloc(&init_mm, dir, address); ret = -ENOMEM; if (!pmd) break; if (alloc_area_pmd(pmd, address, end - address, gfp_mask, prot)) break; address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; ret = 0; } spin_unlock(&init_mm.page_table_lock); return ret; In e a ch cycle , it firs t in vo ke s pmd_alloc( ) t o cre a t e a Pa g e Mid d le Dire ct o ry fo r t h e n e w a re a a n d writ e s it s p h ys ica l a d d re s s in t h e rig h t e n t ry o f t h e ke rn e l Pa g e Glo b a l Dire ct o ry. It t h e n ca lls alloc_area_pmd( ) t o a llo ca t e a ll t h e Pa g e Ta b le s a s s o cia t e d wit h t h e n e w Pa g e Mid d le Dire ct o ry. It a d d s t h e co n s t a n t 2 2 2 —t h e s ize o f t h e ra n g e o f lin e a r a d d re s s e s s p a n n e d b y a s in g le Pa g e Mid d le Dire ct o ry—t o t h e cu rre n t va lu e o f address, a n d it in cre a s e s t h e p o in t e r dir t o t h e Pa g e Glo b a l Dire ct o ry.

Th e cycle is re p e a t e d u n t il a ll Pa g e Ta b le e n t rie s re fe rrin g t o t h e n o n co n t ig u o u s m e m o ry a re a a re s e t u p . Th e alloc_area_pmd( ) fu n ct io n e xe cu t e s a s im ila r cycle fo r a ll t h e Pa g e Ta b le s t h a t a Pa g e Mid d le Dire ct o ry p o in t s t o :

while (address < end) { pte_t * pte = pte_alloc(&init_mm, pmd, address); if (!pte) return -ENOMEM;

if (alloc_area_pte(pte, address, end - address)) return -ENOMEM; address = (address + PMD_SIZE) & PMD_MASK; pmd++; } Th e pte_alloc( ) fu n ct io n ( s e e S e ct io n 2 . 5 . 2 ) a llo ca t e s a n e w Pa g e Ta b le a n d u p d a t e s t h e co rre s p o n d in g e n t ry in t h e Pa g e Mid d le Dire ct o ry. Ne xt , alloc_area_pte( ) a llo ca t e s a ll t h e p a g e fra m e s co rre s p o n d in g t o t h e e n t rie s in t h e Pa g e Ta b le . Th e va lu e o f address is in cre a s e d b y 2 2 2 —t h e s ize o f t h e lin e a r a d d re s s in t e rva l s p a n n e d b y a s in g le Pa g e Ta b le —a n d t h e cycle is re p e a t e d . Th e m a in cycle o f alloc_area_pte( ) is :

while (address < end) { unsigned long page; spin_unlock(&init_mm.page_table_lock); page_alloc(gfp_mask); spin_lock(&init_mm.page_table_lock); if (!page) return -ENOMEM; set_pte(pte, mk_pte(page, prot)); address += PAGE_SIZE; pte++; } Ea ch p a g e fra m e is a llo ca t e d t h ro u g h page_alloc( ). Th e p h ys ica l a d d re s s o f t h e n e w p a g e fra m e is writ t e n in t o t h e Pa g e Ta b le b y t h e set_pte a n d mk_pte m a cro s . Th e cycle is re p e a t e d a ft e r a d d in g t h e co n s t a n t 4 , 0 9 6 ( t h e le n g t h o f a p a g e fra m e ) t o address.

No t ice t h a t t h e Pa g e Ta b le s o f t h e cu rre n t p ro ce s s a re n o t t o u ch e d b y vmalloc_area_pages( ). Th e re fo re , wh e n a p ro ce s s in Ke rn e l Mo d e a cce s s e s t h e n o n co n t ig u o u s m e m o ry a re a , a Pa g e Fa u lt o ccu rs , s in ce t h e e n t rie s in t h e p ro ce s s 's Pa g e Ta b le s co rre s p o n d in g t o t h e a re a a re n u ll. Ho we ve r, t h e Pa g e Fa u lt h a n d le r ch e cks t h e fa u lt y lin e a r a d d re s s a g a in s t t h e m a s t e r ke rn e l Pa g e Ta b le s ( wh ich a re init_mm.pgd Pa g e Glo b a l Dire ct o ry a n d it s ch ild Pa g e Ta b le s ; s e e S e ct io n 2 . 5 . 5 ) . On ce t h e h a n d le r d is co ve rs t h a t a m a s t e r ke rn e l Pa g e Ta b le in clu d e s a n o n - n u ll e n t ry fo r t h e a d d re s s , it co p ie s it s va lu e in t o t h e co rre s p o n d in g p ro ce s s 's Pa g e Ta b le e n t ry a n d re s u m e s n o rm a l e xe cu t io n o f t h e p ro ce s s . Th is m e ch a n is m is d e s crib e d in S e ct io n 8 . 4 .

7.3.4 Releasing a Noncontiguous Memory Area Th e vfree( ) fu n ct io n re le a s e s n o n co n t ig u o u s m e m o ry a re a s . It s p a ra m e t e r addr co n t a in s t h e in it ia l lin e a r a d d re s s o f t h e a re a t o b e re le a s e d . vfree( ) firs t s ca n s t h e lis t p o in t e d t o b y vmlist t o fin d t h e a d d re s s o f t h e a re a d e s crip t o r a s s o cia t e d wit h t h e a re a t o b e re le a s e d :

write_lock(&vmlist_lock); for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) { if (tmp->addr == addr) { *p = tmp->next; vmfree_area_pages((unsigned long)(tmp->addr), tmp->size);

write_unlock(&vmlist_lock); kfree(tmp); return; } } write_unlock(&vmlist_lock); printk("Trying to vfree( ) nonexistent vm area (%p)\n", addr); Th e size fie ld o f t h e d e s crip t o r s p e cifie s t h e s ize o f t h e a re a t o b e re le a s e d . Th e a re a it s e lf is re le a s e d b y in vo kin g vmfree_area_pages( ), wh ile t h e d e s crip t o r is re le a s e d b y in vo kin g kfree( ).

Th e vmfree_area_pages( ) fu n ct io n t a ke s t wo p a ra m e t e rs : t h e in it ia l lin e a r a d d re s s a n d t h e s ize o f t h e a re a . It e xe cu t e s t h e fo llo win g cycle t o re ve rs e t h e a ct io n s p e rfo rm e d b y vmalloc_area_pages( ):

dir = pgd_offset_k(address); while (address < end) { free_area_pmd(dir, address, end - address); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } In t u rn , free_area_pmd( ) re ve rs e s t h e a ct io n s o f alloc_area_pmd( ) in t h e cycle :

while (address < end) { free_area_pte(pmd, address, end - address); address = (address + PMD_SIZE) & PMD_MASK; pmd++; } Ag a in , free_area_pte( ) re ve rs e s t h e a ct ivit y o f alloc_area_pte( ) in t h e cycle :

while (address < end) { pte_t page = *pte; pte_clear(pte); address += PAGE_SIZE; pte++; if (pte_none(page)) continue; if (pte_present(page)) { _ _free_page(pte_page(page)); continue; } printk("Whee... Swapped out page in kernel page table\n"); } Ea ch p a g e fra m e a s s ig n e d t o t h e n o n co n t ig u o u s m e m o ry a re a is re le a s e d b y m e a n s o f t h e b u d d y s ys t e m _ _free_ page( ) fu n ct io n . Th e co rre s p o n d in g e n t ry in t h e Pa g e Ta b le is s e t t o 0 b y t h e pte_clear m a cro .

As fo r vmalloc( ), t h e ke rn e l m o d ifie s t h e e n t rie s o f t h e m a s t e r ke rn e l Pa g e Glo b a l

Dire ct o ry a n d it s ch ild Pa g e Ta b le s ( s e e S e ct io n 2 . 5 . 5 ) , b u t it le a ve s u n ch a n g e d t h e e n t rie s o f t h e p ro ce s s Pa g e Ta b le s m a p p in g t h e fo u rt h g ig a b yt e . Th is is fin e b e ca u s e t h e ke rn e l n e ve r re cla im s Pa g e Mid d le Dire ct o rie s a n d Pa g e Ta b le s ro o t e d a t t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry. Fo r in s t a n ce , s u p p o s e t h a t a p ro ce s s in Ke rn e l Mo d e a cce s s e d a n o n co n t ig u o u s m e m o ry a re a t h a t la t e r g o t re le a s e d . Th e p ro ce s s 's Pa g e Glo b a l Dire ct o ry e n t rie s a re e q u a l t o t h e co rre s p o n d in g e n t rie s o f t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry, t h a n ks t o t h e m e ch a n is m e xp la in e d in S e ct io n 8 . 4 ; t h e y p o in t t o t h e s a m e Pa g e Mid d le Dire ct o rie s a n d Pa g e Ta b le s . Th e vmfree_area_pages( ) fu n ct io n cle a rs o n ly t h e e n t rie s o f t h e Pa g e Ta b le s ( wit h o u t re cla im in g t h e Pa g e Ta b le s t h e m s e lve s ) . Fu rt h e r a cce s s e s o f t h e p ro ce s s t o t h e re le a s e d n o n co n t ig u o u s m e m o ry a re a will t rig g e r Pa g e Fa u lt s b e ca u s e o f t h e n u ll Pa g e Ta b le e n t rie s . Ho we ve r, t h e h a n d le r will co n s id e r s u ch a cce s s e s a b u g b e ca u s e t h e m a s t e r ke rn e l Pa g e Ta b le s d o n o t in clu d e va lid e n t rie s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 8. Process Address Space As s e e n in t h e p re vio u s ch a p t e r, a ke rn e l fu n ct io n g e t s d yn a m ic m e m o ry in a fa irly s t ra ig h t fo rwa rd m a n n e r b y in vo kin g o n e o f a va rie t y o f fu n ct io n s : _ _get_free_pages( ) o r pages_alloc( ) t o g e t p a g e s fro m t h e b u d d y s ys t e m a lg o rit h m , kmem_cache_alloc(

) o r kmalloc( ) t o u s e t h e s la b a llo ca t o r fo r s p e cia lize d o r g e n e ra l- p u rp o s e o b je ct s , a n d vmalloc( ) t o g e t a n o n co n t ig u o u s m e m o ry a re a . If t h e re q u e s t ca n b e s a t is fie d , e a ch o f t h e s e fu n ct io n s re t u rn s a p a g e d e s crip t o r a d d re s s o r a lin e a r a d d re s s id e n t ifyin g t h e b e g in n in g o f t h e a llo ca t e d d yn a m ic m e m o ry a re a . Th e s e s im p le a p p ro a ch e s wo rk fo r t wo re a s o n s : ●



Th e ke rn e l is t h e h ig h e s t - p rio rit y co m p o n e n t o f t h e o p e ra t in g s ys t e m . If a ke rn e l fu n ct io n m a ke s a re q u e s t fo r d yn a m ic m e m o ry, it m u s t h a ve a va lid re a s o n t o is s u e t h a t re q u e s t , a n d t h e re is n o p o in t in t ryin g t o d e fe r it . Th e ke rn e l t ru s t s it s e lf. All ke rn e l fu n ct io n s a re a s s u m e d t o b e e rro r- fre e , s o t h e ke rn e l d o e s n o t n e e d t o in s e rt a n y p ro t e ct io n a g a in s t p ro g ra m m in g e rro rs .

Wh e n a llo ca t in g m e m o ry t o Us e r Mo d e p ro ce s s e s , t h e s it u a t io n is e n t ire ly d iffe re n t : ●



Pro ce s s re q u e s t s fo r d yn a m ic m e m o ry a re co n s id e re d n o n u rg e n t . Wh e n a p ro ce s s 's e xe cu t a b le file is lo a d e d , fo r in s t a n ce , it is u n like ly t h a t t h e p ro ce s s will a d d re s s a ll t h e p a g e s o f co d e in t h e n e a r fu t u re . S im ila rly, wh e n a p ro ce s s in vo ke s malloc( ) t o g e t a d d it io n a l d yn a m ic m e m o ry, it d o e s n 't m e a n t h e p ro ce s s will s o o n a cce s s a ll t h e a d d it io n a l m e m o ry o b t a in e d . Th u s , a s a g e n e ra l ru le , t h e ke rn e l t rie s t o d e fe r a llo ca t in g d yn a m ic m e m o ry t o Us e r Mo d e p ro ce s s e s . S in ce u s e r p ro g ra m s ca n n o t b e t ru s t e d , t h e ke rn e l m u s t b e p re p a re d t o ca t ch a ll a d d re s s in g e rro rs ca u s e d b y p ro ce s s e s in Us e r Mo d e .

As t h is ch a p t e r d e s crib e s , t h e ke rn e l s u cce e d s in d e fe rrin g t h e a llo ca t io n o f d yn a m ic m e m o ry t o p ro ce s s e s b y u s in g a n e w kin d o f re s o u rce . Wh e n a Us e r Mo d e p ro ce s s a s ks fo r d yn a m ic m e m o ry, it d o e s n 't g e t a d d it io n a l p a g e fra m e s ; in s t e a d , it g e t s t h e rig h t t o u s e a n e w ra n g e o f lin e a r a d d re s s e s , wh ich b e co m e p a rt o f it s a d d re s s s p a ce . Th is in t e rva l is ca lle d a m e m o ry re g io n . In t h e n e xt s e ct io n , we d is cu s s h o w t h e p ro ce s s vie ws d yn a m ic m e m o ry. We t h e n d e s crib e t h e b a s ic co m p o n e n t s o f t h e p ro ce s s a d d re s s s p a ce in S e ct io n 8 . 3 . Ne xt , we e xa m in e in d e t a il t h e ro le p la ye d b y t h e Pa g e Fa u lt e xce p t io n h a n d le r in d e fe rrin g t h e a llo ca t io n o f p a g e fra m e s t o p ro ce s s e s a n d illu s t ra t e h o w t h e ke rn e l cre a t e s a n d d e le t e s wh o le p ro ce s s a d d re s s s p a ce s . La s t , we d is cu s s t h e APIs a n d s ys t e m ca lls re la t e d t o a d d re s s s p a ce m a n a g e m e n t . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.1 The Process's Address Space Th e a d d re s s s p a ce o f a p ro ce s s co n s is t s o f a ll lin e a r a d d re s s e s t h a t t h e p ro ce s s is a llo we d t o u s e . Ea ch p ro ce s s s e e s a d iffe re n t s e t o f lin e a r a d d re s s e s ; t h e a d d re s s u s e d b y o n e p ro ce s s b e a rs n o re la t io n t o t h e a d d re s s u s e d b y a n o t h e r. As we s h a ll s e e la t e r, t h e ke rn e l m a y d yn a m ica lly m o d ify a p ro ce s s a d d re s s s p a ce b y a d d in g o r re m o vin g in t e rva ls o f lin e a r a d d re s s e s . Th e ke rn e l re p re s e n t s in t e rva ls o f lin e a r a d d re s s e s b y m e a n s o f re s o u rce s ca lle d m e m o ry re g io n s , wh ich a re ch a ra ct e rize d b y a n in it ia l lin e a r a d d re s s , a le n g t h , a n d s o m e a cce s s rig h t s . Fo r re a s o n s o f e fficie n cy, b o t h t h e in it ia l a d d re s s a n d t h e le n g t h o f a m e m o ry re g io n m u s t b e m u lt ip le s o f 4 , 0 9 6 s o t h a t t h e d a t a id e n t ifie d b y e a ch m e m o ry re g io n co m p le t e ly fills u p t h e p a g e fra m e s a llo ca t e d t o it . Fo llo win g a re s o m e t yp ica l s it u a t io n s in wh ich a p ro ce s s g e t s n e w m e m o ry re g io n s : ●











Wh e n t h e u s e r t yp e s a co m m a n d a t t h e co n s o le , t h e s h e ll p ro ce s s cre a t e s a n e w p ro ce s s t o e xe cu t e t h e co m m a n d . As a re s u lt , a fre s h a d d re s s s p a ce , t h u s a s e t o f m e m o ry re g io n s , is a s s ig n e d t o t h e n e w p ro ce s s ( s e e S e ct io n 8 . 5 la t e r in t h is ch a p t e r; a ls o , s e e Ch a p t e r 2 0 ) . A ru n n in g p ro ce s s m a y d e cid e t o lo a d a n e n t ire ly d iffe re n t p ro g ra m . In t h is ca s e , t h e p ro ce s s ID re m a in s u n ch a n g e d b u t t h e m e m o ry re g io n s u s e d b e fo re lo a d in g t h e p ro g ra m a re re le a s e d a n d a n e w s e t o f m e m o ry re g io n s is a s s ig n e d t o t h e p ro ce s s ( s e e S e ct io n 2 0 . 4 ) . A ru n n in g p ro ce s s m a y p e rfo rm a "m e m o ry m a p p in g " o n a file ( o r o n a p o rt io n o f it ) . In s u ch ca s e s , t h e ke rn e l a s s ig n s a n e w m e m o ry re g io n t o t h e p ro ce s s t o m a p t h e file ( s e e Ch a p t e r 1 5 ) . A p ro ce s s m a y ke e p a d d in g d a t a o n it s Us e r Mo d e s t a ck u n t il a ll a d d re s s e s in t h e m e m o ry re g io n t h a t m a p t h e s t a ck h a ve b e e n u s e d . In t h is ca s e , t h e ke rn e l m a y d e cid e t o e xp a n d t h e s ize o f t h a t m e m o ry re g io n ( s e e S e ct io n 8 . 4 la t e r in t h is ch a p t e r) . A p ro ce s s m a y cre a t e a n IPC- s h a re d m e m o ry re g io n t o s h a re d a t a wit h o t h e r co o p e ra t in g p ro ce s s e s . In t h is ca s e , t h e ke rn e l a s s ig n s a n e w m e m o ry re g io n t o t h e p ro ce s s t o im p le m e n t t h is co n s t ru ct ( s e e S e ct io n 1 9 . 3 . 5 ) . A p ro ce s s m a y e xp a n d it s d yn a m ic a re a ( t h e h e a p ) t h ro u g h a fu n ct io n s u ch a s malloc( ). As a re s u lt , t h e ke rn e l m a y d e cid e t o e xp a n d t h e s ize o f t h e m e m o ry re g io n a s s ig n e d t o t h e h e a p ( s e e S e ct io n 8 . 6 la t e r in t h is ch a p t e r) .

Ta b le 8 - 1 illu s t ra t e s s o m e o f t h e s ys t e m ca lls re la t e d t o t h e p re vio u s ly m e n t io n e d t a s ks . Wit h t h e e xce p t io n o f brk( ), wh ich is d is cu s s e d a t t h e e n d o f t h is ch a p t e r, t h e s ys t e m ca lls a re d e s crib e d in o t h e r ch a p t e rs .

Ta b le 8 - 1 . S y s t e m c a lls re la t e d t o m e m o ry re g io n c re a t io n a n d d e le t io n

S y s t e m c a ll

D e s c rip t io n

brk( ), sbrk( ) Ch a n g e s t h e h e a p s ize o f t h e p ro ce s s

execve( )

Lo a d s a n e w e xe cu t a b le file , t h u s ch a n g in g t h e p ro ce s s a d d re s s s p a ce

_exit( )

Te rm in a t e s t h e cu rre n t p ro ce s s a n d d e s t ro ys it s a d d re s s s p a ce

fork( )

Cre a t e s a n e w p ro ce s s , a n d t h u s a n e w a d d re s s s p a ce

mmap( )

Cre a t e s a m e m o ry m a p p in g fo r a file , t h u s e n la rg in g t h e p ro ce s s a d d re s s s p a ce

munmap( )

De s t ro ys a m e m o ry m a p p in g fo r a file , t h u s co n t ra ct in g t h e p ro ce s s a d d re s s s p a ce

shmat( )

At t a ch e s a s h a re d m e m o ry re g io n

shmdt( )

De t a ch e s a s h a re d m e m o ry re g io n

As we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 8 . 4 , it is e s s e n t ia l fo r t h e ke rn e l t o id e n t ify t h e m e m o ry re g io n s cu rre n t ly o wn e d b y a p ro ce s s ( t h e a d d re s s s p a ce o f a p ro ce s s ) s in ce t h a t a llo ws t h e Pa g e Fa u lt e xce p t io n h a n d le r t o e fficie n t ly d is t in g u is h b e t we e n t wo t yp e s o f in va lid lin e a r a d d re s s e s t h a t ca u s e it t o b e in vo ke d : ● ●

Th o s e ca u s e d b y p ro g ra m m in g e rro rs . Th o s e ca u s e d b y a m is s in g p a g e ; e ve n t h o u g h t h e lin e a r a d d re s s b e lo n g s t o t h e p ro ce s s 's a d d re s s s p a ce , t h e p a g e fra m e co rre s p o n d in g t o t h a t a d d re s s h a s ye t t o b e a llo ca t e d .

Th e la t t e r a d d re s s e s a re n o t in va lid fro m t h e p ro ce s s 's p o in t o f vie w; t h e ke rn e l h a n d le s t h e Pa g e Fa u lt b y p ro vid in g t h e p a g e fra m e a n d le t t in g t h e p ro ce s s co n t in u e .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.2 The Memory Descriptor All in fo rm a t io n re la t e d t o t h e p ro ce s s a d d re s s s p a ce is in clu d e d in a d a t a s t ru ct u re ca lle d a m e m o ry d e s crip t o r. Th is s t ru ct u re o f t yp e mm_struct is re fe re n ce d b y t h e mm fie ld o f t h e p ro ce s s d e s crip t o r. Th e fie ld s o f a m e m o ry d e s crip t o r a re lis t e d in Ta b le 8 - 2 .

Ta b le 8 - 2 . Th e fie ld s o f t h e m e m o ry d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

struct vm_area_struct * mmap

Po in t e r t o t h e h e a d o f t h e lis t o f m e m o ry re g io n o b je ct s

rb_root_t

Po in t e r t o t h e ro o t o f t h e re d - b la ck t re e o f m e m o ry re g io n o b je ct s

mm_rb

struct vm_area_struct * mmap_cache

Po in t e r t o t h e la s t re fe re n ce d m e m o ry re g io n o b je ct

pgd_t *

pgd

Po in t e r t o t h e Pa g e Glo b a l Dire ct o ry

atomic_t

mm_users

S e co n d a ry u s a g e co u n t e r

atomic_t

mm_count

Ma in u s a g e co u n t e r

int

map_count

Nu m b e r o f m e m o ry re g io n s

struct rw_semaphore

mmap_sem

Me m o ry re g io n s ' re a d / writ e s e m a p h o re

spinlock_t

page_table_lock Me m o ry re g io n s ' a n d Pa g e Ta b le s ' s p in lo ck

struct list_head

mmlist

Po in t e rs t o a d ja ce n t e le m e n t s in t h e lis t o f m e m o ry d e s crip t o rs

unsigned long

start_code

In it ia l a d d re s s o f e xe cu t a b le co d e

unsigned long

end_code

Fin a l a d d re s s o f e xe cu t a b le co d e

unsigned long

start_data

In it ia l a d d re s s o f in it ia lize d d a t a

unsigned long

end_data

Fin a l a d d re s s o f in it ia lize d d a t a

unsigned long

start_brk

In it ia l a d d re s s o f t h e h e a p

unsigned long

brk

Cu rre n t fin a l a d d re s s o f t h e h e a p

unsigned long

start_stack

In it ia l a d d re s s o f Us e r Mo d e s t a ck

unsigned long

arg_start

In it ia l a d d re s s o f co m m a n d - lin e a rg u m e n t s

unsigned long

arg_end

Fin a l a d d re s s o f co m m a n d - lin e a rg u m e n t s

unsigned long

env_start

In it ia l a d d re s s o f e n viro n m e n t va ria b le s

unsigned long

env_end

Fin a l a d d re s s o f e n viro n m e n t va ria b le s

unsigned long

rss

Nu m b e r o f p a g e fra m e s a llo ca t e d t o t h e p ro ce s s

unsigned long

total_vm

S ize o f t h e p ro ce s s a d d re s s s p a ce (num be r of pa ge s)

unsigned long

locked_vm

Nu m b e r o f "lo cke d " p a g e s t h a t ca n n o t b e s wa p p e d o u t ( s e e Ch a p t e r 1 6 )

unsigned long

def_flags

De fa u lt a cce s s fla g s o f t h e m e m o ry re g io n s

unsigned long

cpu_vm_mask

Bit m a s k fo r la zy TLB s wit ch e s ( s e e Ch a p t e r 2 )

unsigned long

swap_address

La s t s ca n n e d lin e a r a d d re s s fo r s wa p p in g ( s e e Ch a p t e r 1 6 )

unsigned int

dumpable

Fla g t h a t s p e cifie s wh e t h e r t h e p ro ce s s ca n p ro d u ce a co re d u m p o f t h e m e m o ry

mm_context_t

context

Po in t e r t o t a b le fo r a rch it e ct u re - s p e cific in fo rm a t io n ( e . g . , LDT's a d d re s s in 8 0 x 8 6 p la t fo rm s )

All m e m o ry d e s crip t o rs a re s t o re d in a d o u b ly lin ke d lis t . Ea ch d e s crip t o r s t o re s t h e a d d re s s

o f t h e a d ja ce n t lis t it e m s in t h e mmlist fie ld . Th e firs t e le m e n t o f t h e lis t is t h e mmlist fie ld o f init_mm, t h e m e m o ry d e s crip t o r u s e d b y p ro ce s s 0 in t h e in it ia liza t io n p h a s e . Th e lis t is p ro t e ct e d a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s b y t h e mmlist_lock s p in lo ck. Th e n u m b e r o f m e m o ry d e s crip t o rs in t h e lis t is s t o re d in t h e mmlist_nr va ria b le .

Th e mm_users fie ld s t o re s t h e n u m b e r o f lig h t we ig h t p ro ce s s e s t h a t s h a re t h e mm_struct d a t a s t ru ct u re ( s e e S e ct io n 3 . 4 . 1 ) . Th e mm_count fie ld is t h e m a in u s a g e co u n t e r o f t h e m e m o ry d e s crip t o r; a ll "u s e rs " in mm_users co u n t a s o n e u n it in mm_count. Eve ry t im e t h e

mm_count fie ld is d e cre m e n t e d , t h e ke rn e l ch e cks wh e t h e r it b e co m e s ze ro ; if s o , t h e m e m o ry d e s crip t o r is d e a llo ca t e d b e ca u s e it is n o lo n g e r in u s e . We 'll t ry t o e xp la in t h e d iffe re n ce b e t we e n t h e u s e o f mm_users a n d mm_count wit h a n e xa m p le . Co n s id e r a m e m o ry d e s crip t o r s h a re d b y t wo lig h t we ig h t p ro ce s s e s . No rm a lly, it s mm_users fie ld s t o re s t h e va lu e 2 , wh ile it s mm_count fie ld s t o re s t h e va lu e 1 ( b o t h o wn e r p ro ce s s e s co u n t a s o n e ) . If t h e m e m o ry d e s crip t o r is t e m p o ra rily le n t t o a ke rn e l t h re a d ( s e e t h e n e xt s e ct io n ) , t h e ke rn e l in cre m e n t s t h e mm_count fie ld . In t h is wa y, e ve n if b o t h lig h t we ig h t p ro ce s s e s d ie a n d t h e mm_users fie ld b e co m e s ze ro , t h e m e m o ry d e s crip t o r is n o t re le a s e d u n t il t h e ke rn e l t h re a d fin is h e s u s in g it b e ca u s e t h e mm_count fie ld re m a in s g re a t e r t h a n ze ro .

If t h e ke rn e l wa n t s t o b e s u re t h a t t h e m e m o ry d e s crip t o r is n o t re le a s e d in t h e m id d le o f a le n g t h y o p e ra t io n , it m ig h t in cre m e n t t h e mm_users fie ld in s t e a d o f mm_count ( t h is is wh a t t h e swap_out( ) fu n ct io n d o e s ; s e e S e ct io n 1 6 . 5 ) . Th e fin a l re s u lt is t h e s a m e b e ca u s e t h e in cre m e n t o f mm_users e n s u re s t h a t mm_count d o e s n o t b e co m e ze ro e ve n if a ll lig h t we ig h t p ro ce s s e s t h a t o wn t h e m e m o ry d e s crip t o r d ie . Th e mm_alloc( ) fu n ct io n is in vo ke d t o g e t a n e w m e m o ry d e s crip t o r. S in ce t h e s e d e s crip t o rs a re s t o re d in a s la b a llo ca t o r ca ch e , mm_alloc( ) ca lls kmem_cache_alloc(

), in it ia lize s t h e n e w m e m o ry d e s crip t o r, a n d s e t s t h e mm_count a n d mm_users fie ld t o 1 . Co n ve rs e ly, t h e mmput( ) fu n ct io n d e cre m e n t s t h e mm_users fie ld o f a m e m o ry d e s crip t o r. If t h a t fie ld b e co m e s 0 , t h e fu n ct io n re le a s e s t h e Lo ca l De s crip t o r Ta b le , t h e m e m o ry re g io n d e s crip t o rs ( s e e la t e r in t h is ch a p t e r) , a n d t h e Pa g e Ta b le s re fe re n ce d b y t h e m e m o ry d e s crip t o r, a n d t h e n in vo ke s mmdrop( ). Th e la t t e r fu n ct io n d e cre m e n t s mm_count a n d , if it b e co m e s ze ro , re le a s e s t h e mm_struct d a t a s t ru ct u re .

Th e mmap, mm_rb, mmlist, a n d mmap_cache fie ld s a re d is cu s s e d in t h e n e xt s e ct io n .

8.2.1 Memory Descriptor of Kernel Threads Ke rn e l t h re a d s ru n o n ly in Ke rn e l Mo d e , s o t h e y n e ve r a cce s s lin e a r a d d re s s e s b e lo w

TASK_SIZE ( s a m e a s PAGE_OFFSET, u s u a lly 0xc0000000) . Co n t ra ry t o re g u la r p ro ce s s e s , ke rn e l t h re a d s d o n o t u s e m e m o ry re g io n s , t h e re fo re m o s t o f t h e fie ld s o f a m e m o ry d e s crip t o r a re m e a n in g le s s fo r t h e m . S in ce t h e Pa g e Ta b le e n t rie s t h a t re fe r t o t h e lin e a r a d d re s s a b o ve TASK_SIZE s h o u ld a lwa ys b e id e n t ica l, it d o e s n o t re a lly m a t t e r wh a t s e t o f Pa g e Ta b le s a ke rn e l t h re a d u s e s .

To a vo id u s e le s s TLB a n d ca ch e flu s h e s , ke rn e l t h re a d s u s e t h e Pa g e Ta b le s o f a re g u la r p ro ce s s in Lin u x 2 . 4 . To t h a t e n d , t wo kin d s o f m e m o ry d e s crip t o r p o in t e rs a re in clu d e d in e ve ry m e m o ry d e s crip t o r: mm a n d active_mm.

Th e mm fie ld in t h e p ro ce s s d e s crip t o r p o in t s t o t h e m e m o ry d e s crip t o r o wn e d b y t h e p ro ce s s , wh ile t h e active_mm fie ld p o in t s t o t h e m e m o ry d e s crip t o r u s e d b y t h e p ro ce s s wh e n it is in e xe cu t io n . Fo r re g u la r p ro ce s s e s , t h e t wo fie ld s s t o re t h e s a m e p o in t e r. Ke rn e l t h re a d s , h o we ve r, d o n o t o wn a n y m e m o ry d e s crip t o r, t h u s t h e ir mm fie ld is a lwa ys NULL. Wh e n a ke rn e l t h re a d is s e le ct e d fo r e xe cu t io n , it s active_mm fie ld is in it ia lize d t o t h e va lu e o f t h e active_mm o f t h e p re vio u s ly ru n n in g p ro ce s s ( s e e S e ct io n 1 1 . 2 . 2 . 3 ) .

Th e re is , h o we ve r, a s m a ll co m p lica t io n . Wh e n e ve r a p ro ce s s in Ke rn e l Mo d e m o d ifie s a Pa g e Ta b le e n t ry fo r a "h ig h " lin e a r a d d re s s ( a b o ve TASK_SIZE) , it s h o u ld a ls o u p d a t e t h e co rre s p o n d in g e n t ry in t h e s e t s o f Pa g e Ta b le s o f a ll p ro ce s s e s in t h e s ys t e m . In fa ct , o n ce s e t b y a p ro ce s s in Ke rn e l Mo d e , t h e m a p p in g s h o u ld b e e ffe ct ive fo r a ll o t h e r p ro ce s s e s in Ke rn e l Mo d e a s we ll. To u ch in g t h e s e t s o f Pa g e Ta b le s o f a ll p ro ce s s e s is a co s t ly o p e ra t io n ; t h e re fo re , Lin u x a d o p t s a d e fe rre d a p p ro a ch . We a lre a d y m e n t io n e d t h is d e fe rre d a p p ro a ch in S e ct io n 7 . 3 : e ve ry t im e a h ig h lin e a r a d d re s s h a s t o b e re m a p p e d ( t yp ica lly b y vmalloc( ) o r vfree( )) , t h e ke rn e l u p d a t e s a ca n o n ica l s e t o f Pa g e Ta b le s ro o t e d a t t h e swapper_pg_dir m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry ( s e e S e ct io n 2 . 5 . 5 ) . Th is Pa g e Glo b a l Dire ct o ry is p o in t e d t o b y t h e pgd fie ld o f a m a s t e r m e m o ry d e s crip t o r, wh ich is s t o re d in t h e init_mm va ria b le . [ 1 ]

[1]

We m e n t io n e d in S e ct io n 3 . 4 . 2 t h a t t h e s w a p p e r ke rn e l t h re a d u s e s init_mm d u rin g t h e in it ia liza t io n p h a s e . Ho we ve r, s wa p p e r n e ve r u s e s t h is m e m o ry d e s crip t o r o n ce t h e in it ia liza t io n p h a s e co m p le t e s . La t e r, in S e ct io n 8 . 4 . 5 , we 'll d e s crib e h o w t h e Pa g e Fa u lt h a n d le r t a ke s ca re o f s p re a d in g t h e in fo rm a t io n s t o re d in t h e ca n o n ica l Pa g e Ta b le s wh e n e ffe ct ive ly n e e d e d . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.3 Memory Regions Lin u x im p le m e n t s a m e m o ry re g io n b y m e a n s o f a n o b je ct o f t yp e vm_area_struct; it s fie ld s a re s h o wn in Ta b le 8 - 3 .

Ta b le 8 - 3 . Th e fie ld s o f t h e m e m o ry re g io n o b je c t

Ty p e

Fie ld

D e s c rip t io n

struct mm_struct *

vm_mm

Po in t e r t o t h e m e m o ry d e s crip t o r t h a t o wn s t h e re g io n

unsigned long

vm_start

Firs t lin e a r a d d re s s in s id e t h e re g io n

unsigned long

vm_end

Firs t lin e a r a d d re s s a ft e r t h e re g io n

struct vm_area_struct *

vm_next

Ne xt re g io n in t h e p ro ce s s lis t

pgprot_t

vm_page_prot

Acce s s p e rm is s io n s fo r t h e p a g e fra m e s o f t h e re g io n

unsigned long

vm_flags

Fla g s o f t h e re g io n

rb_node_t

vm_rb

Da t a fo r t h e re d - b la ck t re e ( s e e la t e r in t h is ch a p t e r)

struct vm_area_struct *

vm_next_share

Po in t e r t o t h e n e xt e le m e n t in t h e file m e m o ry m a p p in g lis t

struct vm_area_struct **

vm_pprev_share Po in t e r t o p re vio u s e le m e n t in t h e file m e m o ry m a p p in g lis t

struct vm_operations_struct * vm_ops

Po in t e r t o t h e m e t h o d s o f t h e m e m o ry re g io n

unsigned long

vm_pgoff

Offs e t in m a p p e d file , if a n y ( s e e Ch a p t e r 1 5 )

s t ru ct file *

vm _ file

Po in t e r t o t h e file o b je ct o f t h e m a p p e d file , if a n y

unsigned long

vm_raend

En d o f cu rre n t re a d - a h e a d win d o w o f t h e m a p p e d file ( s e e S e ct io n 1 5 . 2 . 4 )

void *

vm_private_data Po in t e r t o p riva t e d a t a o f t h e m e m o ry re g io n

Ea ch m e m o ry re g io n d e s crip t o r id e n t ifie s a lin e a r a d d re s s in t e rva l. Th e vm_start fie ld co n t a in s t h e firs t lin e a r a d d re s s o f t h e in t e rva l, wh ile t h e vm_end fie ld co n t a in s t h e firs t lin e a r a d d re s s o u t s id e o f t h e in t e rva l; vm_end - vm_start t h u s d e n o t e s t h e le n g t h o f t h e m e m o ry re g io n . Th e vm_mm fie ld p o in t s t o t h e mm_struct m e m o ry d e s crip t o r o f t h e p ro ce s s t h a t o wn s t h e re g io n . We s h a ll d e s crib e t h e o t h e r fie ld s o f vm_area_struct a s t h e y co m e u p .

Me m o ry re g io n s o wn e d b y a p ro ce s s n e ve r o ve rla p , a n d t h e ke rn e l t rie s t o m e rg e re g io n s wh e n a n e w o n e is a llo ca t e d rig h t n e xt t o a n e xis t in g o n e . Two a d ja ce n t re g io n s ca n b e m e rg e d if t h e ir a cce s s rig h t s m a t ch . As s h o wn in Fig u re 8 - 1 , wh e n a n e w ra n g e o f lin e a r a d d re s s e s is a d d e d t o t h e p ro ce s s a d d re s s s p a ce , t h e ke rn e l ch e cks wh e t h e r a n a lre a d y e xis t in g m e m o ry re g io n ca n b e e n la rg e d ( ca s e a ) . If n o t , a n e w m e m o ry re g io n is cre a t e d ( ca s e b ) . S im ila rly, if a ra n g e o f lin e a r a d d re s s e s is re m o ve d fro m t h e p ro ce s s a d d re s s s p a ce , t h e ke rn e l re s ize s t h e a ffe ct e d m e m o ry re g io n s ( ca s e c) . In s o m e ca s e s , t h e re s izin g fo rce s a m e m o ry re g io n t o s p lit in t o t wo s m a lle r o n e s ( ca s e d ) . [ 2 ] [2]

Re m o vin g a lin e a r a d d re s s in t e rva l m a y t h e o re t ica lly fa il b e ca u s e n o fre e m e m o ry is a va ila b le fo r a n e w m e m o ry d e s crip t o r. Fig u re 8 - 1 . Ad d in g o r re m o v in g a lin e a r a d d re s s in t e rv a l

Th e vm_ops fie ld p o in t s t o a vm_operations_struct d a t a s t ru ct u re , wh ich s t o re s t h e m e t h o d s o f t h e m e m o ry re g io n . On ly t h re e m e t h o d s a re d e fin e d :

open In vo ke d wh e n t h e m e m o ry re g io n is a d d e d t o t h e s e t o f re g io n s o wn e d b y a p ro ce s s .

close In vo ke d wh e n t h e m e m o ry re g io n is re m o ve d fro m t h e s e t o f re g io n s o wn e d b y a p ro ce s s .

nopage In vo ke d b y t h e Pa g e Fa u lt e xce p t io n h a n d le r wh e n a p ro ce s s t rie s t o a cce s s a p a g e n o t p re s e n t in RAM wh o s e lin e a r a d d re s s b e lo n g s t o t h e m e m o ry re g io n ( s e e t h e la t e r s e ct io n S e ct io n 8 . 4 ) .

8.3.1 Memory Region Data Structures All t h e re g io n s o wn e d b y a p ro ce s s a re lin ke d in a s im p le lis t . Re g io n s a p p e a r in t h e lis t in a s ce n d in g o rd e r b y m e m o ry a d d re s s ; h o we ve r, s u cce s s ive re g io n s ca n b e s e p a ra t e d b y a n a re a o f u n u s e d m e m o ry a d d re s s e s . Th e vm_next fie ld o f e a ch vm_area_struct e le m e n t p o in t s t o t h e n e xt e le m e n t in t h e lis t . Th e ke rn e l fin d s t h e m e m o ry re g io n s t h ro u g h t h e mmap fie ld o f t h e p ro ce s s m e m o ry d e s crip t o r, wh ich p o in t s t o t h e firs t m e m o ry re g io n d e s crip t o r in t h e lis t . Th e map_count fie ld o f t h e m e m o ry d e s crip t o r co n t a in s t h e n u m b e r o f re g io n s o wn e d b y t h e p ro ce s s . A p ro ce s s m a y o wn u p t o MAX_MAP_COUNT d iffe re n t m e m o ry re g io n s ( t h is va lu e is u s u a lly se t to 65,536). Fig u re 8 - 2 illu s t ra t e s t h e re la t io n s h ip s a m o n g t h e a d d re s s s p a ce o f a p ro ce s s , it s m e m o ry d e s crip t o r, a n d t h e lis t o f m e m o ry re g io n s . Fig u re 8 - 2 . D e s c rip t o rs re la t e d t o t h e a d d re s s s p a c e o f a p ro c e s s

A fre q u e n t o p e ra t io n p e rfo rm e d b y t h e ke rn e l is t o s e a rch t h e m e m o ry re g io n t h a t in clu d e s a s p e cific lin e a r a d d re s s . S in ce t h e lis t is s o rt e d , t h e s e a rch ca n t e rm in a t e a s s o o n a s a m e m o ry re g io n t h a t e n d s a ft e r t h e s p e cific lin e a r a d d re s s is fo u n d . Ho we ve r, u s in g t h e lis t is co n ve n ie n t o n ly if t h e p ro ce s s h a s ve ry fe w m e m o ry re g io n s —le t 's s a y le s s t h a n a fe w t e n s o f t h e m . S e a rch in g , in s e rt in g e le m e n t s , a n d d e le t in g e le m e n t s in t h e lis t in vo lve a n u m b e r o f o p e ra t io n s wh o s e t im e s a re lin e a rly p ro p o rt io n a l t o t h e lis t le n g t h .

Alt h o u g h m o s t Lin u x p ro ce s s e s u s e ve ry fe w m e m o ry re g io n s , t h e re a re s o m e la rg e a p p lica t io n s , s u ch a s o b je ct - o rie n t e d d a t a b a s e s , t h a t o n e m ig h t co n s id e r "p a t h o lo g ica l" b e ca u s e t h e y h a ve m a n y h u n d re d s o r e ve n t h o u s a n d s o f re g io n s . In s u ch ca s e s , t h e m e m o ry re g io n lis t m a n a g e m e n t b e co m e s ve ry in e fficie n t , h e n ce t h e p e rfo rm a n ce o f t h e m e m o ry- re la t e d s ys t e m ca lls d e g ra d e s t o a n in t o le ra b le p o in t . Th e re fo re , Lin u x 2 . 4 s t o re s m e m o ry d e s crip t o rs in d a t a s t ru ct u re s ca lle d re d - b la ck t re e s . [ 3 ] In a n re d - b la ck t re e , e a ch e le m e n t ( o r n o d e ) u s u a lly h a s t wo ch ild re n : a le ft ch ild a n d a rig h t ch ild . Th e e le m e n t s in t h e t re e a re s o rt e d . Fo r e a ch n o d e N, a ll e le m e n t s o f t h e s u b t re e ro o t e d a t t h e le ft ch ild o f N p re ce d e N, wh ile , co n ve rs e ly, a ll e le m e n t s o f t h e s u b t re e ro o t e d a t t h e rig h t ch ild o f N fo llo w N ( s e e Fig u re 8 - 3 ( a ) ; t h e ke y o f t h e n o d e is writ t e n in s id e t h e n o d e it s e lf. [3]

Up t o Ve rs io n 2 . 4 . 9 , t h e Lin u x ke rn e l u s e d a n o t h e r t yp e o f b a la n ce d s e a rch t re e ca lle d AVL t re e . Mo re o ve r, a re d - b la ck t re e m u s t s a t is fy fo u r a d d it io n a l ru le s : 1 . Eve ry n o d e m u s t b e e it h e r re d o r b la ck. 2 . Th e ro o t o f t h e t re e m u s t b e b la ck. 3 . Th e ch ild re n o f a re d n o d e m u s t b e b la ck. 4 . Eve ry p a t h fro m a n o d e t o a d e s ce n d a n t le a f m u s t co n t a in t h e s a m e n u m b e r o f b la ck n o d e s . Wh e n co u n t in g t h e n u m b e r o f b la ck n o d e s , n u ll p o in t e rs a re co u n t e d a s b la ck n o d e s . Fig u re 8 - 3 . Ex a m p le o f re d - b la c k t re e s

Th e s e fo u r ru le s e n s u re t h a t a n y re d - b la ck t re e wit h n in t e rn a l n o d e s h a s a h e ig h t o f a t m o s t 2 lo g ( n + 1). S e a rch in g a n e le m e n t in a re d - b la ck t re e is t h u s ve ry e fficie n t b e ca u s e it re q u ire s o p e ra t io n s wh o s e e xe cu t io n t im e is lin e a rly p ro p o rt io n a l t o t h e lo g a rit h m o f t h e t re e s ize . In o t h e r wo rd s , d o u b lin g t h e n u m b e r o f m e m o ry re g io n s a d d s ju s t o n e m o re it e ra t io n t o t h e o p e ra t io n . In s e rt in g a n d d e le t in g a n e le m e n t in a re d - b la ck t re e is a ls o e fficie n t b e ca u s e t h e a lg o rit h m ca n q u ickly t ra ve rs e t h e t re e t o lo ca t e t h e p o s it io n a t wh ich t h e e le m e n t will b e in s e rt e d o r fro m wh ich it will b e re m o ve d . An y n e w n o d e m u s t b e in s e rt e d a s a le a f a n d co lo re d re d . If t h e o p e ra t io n b re a ks

t h e ru le s , a fe w n o d e s o f t h e t re e m u s t b e m o ve d o r re co lo re d . Fo r in s t a n ce , s u p p o s e t h a t a n e le m e n t h a vin g t h e va lu e 4 m u s t b e in s e rt e d in t h e re d - b la ck t re e s h o wn in Fig u re 8 - 3 ( a ) . It s p ro p e r p o s it io n is t h e rig h t ch ild o f t h e n o d e t h a t h a s ke y 3 , b u t o n ce it is in s e rt e d , t h e re d n o d e t h a t h a s t h e va lu e 3 h a s a re d ch ild , t h u s b re a kin g ru le 3 . To s a t is fy t h e ru le , t h e co lo r o f n o d e s t h a t h a ve t h e va lu e s 3 , 4 , a n d 7 is ch a n g e d . Th is o p e ra t io n , h o we ve r, b re a ks ru le 4 , t h u s t h e a lg o rit h m p e rfo rm s a "ro t a t io n " o n t h e s u b t re e ro o t e d a t t h e n o d e t h a t h a s t h e ke y 1 9 , p ro d u cin g t h e n e w re d - b la ck t re e s h o wn in Fig u re 8 - 3 ( b ) . Th is lo o ks co m p lica t e d , b u t in s e rt in g o r d e le t in g a n e le m e n t in a re d - b la ck t re e re q u ire s a s m a ll n u m b e r o f o p e ra t io n s —a n u m b e r lin e a rly p ro p o rt io n a l t o t h e lo g a rit h m o f t h e t re e s ize . Th e re fo re , t o s t o re t h e m e m o ry re g io n s o f a p ro ce s s , Lin u x u s e s b o t h a lin ke d lis t a n d a re d - b la ck t re e . Bo t h d a t a s t ru ct u re s co n t a in p o in t e rs t o t h e s a m e m e m o ry re g io n d e s crip t o rs , Wh e n in s e rt in g o r re m o vin g a m e m o ry re g io n d e s crip t o r, t h e ke rn e l s e a rch e s t h e p re vio u s a n d n e xt e le m e n t s t h ro u g h t h e re d - b la ck t re e a n d u s e s t h e m t o q u ickly u p d a t e t h e lis t wit h o u t s ca n n in g it . Th e h e a d o f t h e lin ke d lis t is re fe re n ce d b y t h e mmap fie ld o f t h e m e m o ry d e s crip t o r. An y m e m o ry re g io n o b je ct s t o re s t h e p o in t e r t o t h e n e xt e le m e n t o f t h e lis t in t h e vm_next fie ld . Th e h e a d o f t h e re d - b la ck t re e is re fe rre d b y t h e mm_rb fie ld o f t h e m e m o ry d e s crip t o r. An y m e m o ry re g io n o b je ct s t o re s t h e co lo r o f t h e n o d e , a s we ll a s t h e p o in t e rs t o t h e p a re n t , t h e le ft ch ild , a n d t h e rig h t ch ild , in t o t h e vm_rb fie ld o f t yp e rb_node_t.

In g e n e ra l, t h e re d - b la ck t re e is u s e d t o lo ca t e a re g io n in clu d in g a s p e cific a d d re s s , wh ile t h e lin ke d lis t is m o s t ly u s e fu l wh e n s ca n n in g t h e wh o le s e t o f re g io n s .

8.3.2 Memory Region Access Rights Be fo re m o vin g o n , we s h o u ld cla rify t h e re la t io n b e t we e n a p a g e a n d a m e m o ry re g io n . As m e n t io n e d in Ch a p t e r 2 , we u s e t h e t e rm "p a g e " t o re fe r b o t h t o a s e t o f lin e a r a d d re s s e s a n d t o t h e d a t a co n t a in e d in t h is g ro u p o f a d d re s s e s . In p a rt icu la r, we d e n o t e t h e lin e a r a d d re s s in t e rva l ra n g in g b e t we e n 0 a n d 4 , 0 9 5 a s p a g e 0 , t h e lin e a r a d d re s s in t e rva l ra n g in g b e t we e n 4 , 0 9 6 a n d 8 , 1 9 1 a s p a g e 1 , a n d s o fo rt h . Ea ch m e m o ry re g io n t h e re fo re co n s is t s o f a s e t o f p a g e s t h a t h a ve co n s e cu t ive p a g e n u m b e rs . We h a ve a lre a d y d is cu s s e d t wo kin d s o f fla g s a s s o cia t e d wit h a p a g e : ●

A fe w fla g s s u ch a s Read/Write, Present, o r User/Supervisor s t o re d in e a ch Pa g e Ta b le e n t ry ( s e e S e ct io n 2 . 4 . 1 ) .



A s e t o f fla g s s t o re d in t h e flags fie ld o f e a ch page d e s crip t o r ( s e e S e ct io n 7 . 1 ) .

Th e firs t kin d o f fla g is u s e d b y t h e 8 0 x 8 6 h a rd wa re t o ch e ck wh e t h e r t h e re q u e s t e d kin d o f a d d re s s in g ca n b e p e rfo rm e d ; t h e s e co n d kin d is u s e d b y Lin u x fo r m a n y d iffe re n t p u rp o s e s ( s e e Ta b le 7 - 2 ) . We n o w in t ro d u ce a t h ird kin d o f fla g s : t h o s e a s s o cia t e d wit h t h e p a g e s o f a m e m o ry re g io n . Th e y a re s t o re d in t h e vm_flags fie ld o f t h e vm_area_struct d e s crip t o r ( s e e Ta b le 8 - 4 ) . S o m e fla g s o ffe r t h e ke rn e l in fo rm a t io n a b o u t a ll t h e p a g e s o f t h e m e m o ry re g io n , s u ch a s wh a t t h e y co n t a in a n d wh a t rig h t s t h e p ro ce s s h a s t o a cce s s e a ch p a g e . Ot h e r fla g s d e s crib e t h e re g io n it s e lf, s u ch a s h o w it ca n g ro w.

Ta b le 8 - 4 . Th e m e m o ry re g io n fla g s

Fla g n a m e

D e s c rip t io n

VM_READ

Pa g e s ca n b e re a d .

VM_WRITE

Pa g e s ca n b e writ t e n .

VM_EXEC

Pa g e s ca n b e e xe cu t e d .

VM_SHARED

Pa g e s ca n b e s h a re d b y s e ve ra l p ro ce s s e s .

VM_MAYREAD

VM_READ fla g m a y b e s e t .

VM_MAYWRITE

VM_WRITE fla g m a y b e s e t .

VM_MAYEXEC

VM_EXEC fla g m a y b e s e t .

VM_MAYSHARE

VM_SHARE fla g m a y b e s e t .

VM_GROWSDOWN

Th e re g io n ca n e xp a n d t o wa rd lo we r a d d re s s e s .

VM_GROWSUP

Th e re g io n ca n e xp a n d t o wa rd h ig h e r a d d re s s e s .

VM_SHM

Th e re g io n is u s e d fo r IPC's s h a re d m e m o ry.

VM_DENYWRITE

Th e re g io n m a p s a file t h a t ca n n o t b e o p e n e d fo r writ in g .

VM_EXECUTABLE

Th e re g io n m a p s a n e xe cu t a b le file .

VM_LOCKED

Pa g e s in t h e re g io n a re lo cke d a n d ca n n o t b e s wa p p e d o u t .

VM_IO

Th e re g io n m a p s t h e I/ O a d d re s s s p a ce o f a d e vice .

VM_SEQ_READ

Th e a p p lica t io n a cce s s e s t h e p a g e s s e q u e n t ia lly.

VM_RAND_READ

Th e a p p lica t io n a cce s s e s t h e p a g e s in a t ru ly ra n d o m o rd e r.

VM_DONTCOPY

Do e s n o t co p y t h e re g io n wh e n fo rkin g a n e w p ro ce s s .

VM_DONTEXPAND

Fo rb id s re g io n e xp a n s io n t h ro u g h mremap( ) s ys t e m ca ll.

VM_RESERVED

Do e s n o t s wa p o u t t h e re g io n .

Pa g e a cce s s rig h t s in clu d e d in a m e m o ry re g io n d e s crip t o r m a y b e co m b in e d a rb it ra rily. It is

p o s s ib le , fo r in s t a n ce , t o a llo w t h e p a g e s o f a re g io n t o b e e xe cu t e d b u t n o t re a d . To im p le m e n t t h is p ro t e ct io n s ch e m e e fficie n t ly, t h e re a d , writ e , a n d e xe cu t e a cce s s rig h t s a s s o cia t e d wit h t h e p a g e s o f a m e m o ry re g io n m u s t b e d u p lica t e d in a ll t h e co rre s p o n d in g Pa g e Ta b le e n t rie s s o t h a t ch e cks ca n b e d ire ct ly p e rfo rm e d b y t h e Pa g in g Un it circu it ry. In o t h e r wo rd s , t h e p a g e a cce s s rig h t s d ict a t e wh a t kin d s o f a cce s s s h o u ld g e n e ra t e a Pa g e Fa u lt e xce p t io n . As we s h a ll s e e s h o rt ly, t h e jo b o f fig u rin g o u t wh a t ca u s e d t h e Pa g e Fa u lt is d e le g a t e d b y Lin u x t o t h e Pa g e Fa u lt h a n d le r, wh ich im p le m e n t s s e ve ra l p a g e - h a n d lin g s t ra t e g ie s . Th e in it ia l va lu e s o f t h e Pa g e Ta b le fla g s ( wh ich m u s t b e t h e s a m e fo r a ll p a g e s in t h e m e m o ry re g io n , a s we h a ve s e e n ) a re s t o re d in t h e vm_ page_ prot fie ld o f t h e vm_area_struct d e s crip t o r. Wh e n a d d in g a p a g e , t h e ke rn e l s e t s t h e fla g s in t h e co rre s p o n d in g Pa g e Ta b le e n t ry a cco rd in g t o t h e va lu e o f t h e vm_ page_ prot fie ld .

Ho we ve r, t ra n s la t in g t h e m e m o ry re g io n 's a cce s s rig h t s in t o t h e p a g e p ro t e ct io n b it s is n o t s t ra ig h t fo rwa rd fo r t h e fo llo win g re a s o n s : ●

In s o m e ca s e s , a p a g e a cce s s s h o u ld g e n e ra t e a Pa g e Fa u lt e xce p t io n e ve n wh e n it s a cce s s t yp e is g ra n t e d b y t h e p a g e a cce s s rig h t s s p e cifie d in t h e vm_flags fie ld o f t h e co rre s p o n d in g m e m o ry re g io n . Fo r in s t a n ce , a s we s h a ll s e e in S e ct io n 8 . 4 . 4 la t e r in t h is ch a p t e r, t h e ke rn e l m a y wis h t o s t o re t wo id e n t ica l, writ a b le p riva t e p a g e s ( wh o s e VM_SHARE fla g s a re cle a re d ) b e lo n g in g t o t wo d iffe re n t p ro ce s s e s in t o t h e s a m e p a g e



fra m e ; in t h is ca s e , a n e xce p t io n s h o u ld b e g e n e ra t e d wh e n e it h e r o n e o f t h e p ro ce s s e s t rie s t o m o d ify t h e p a g e . 8 0 x 8 6 p ro ce s s o rs 's Pa g e Ta b le s h a ve ju s t t wo p ro t e ct io n b it s , n a m e ly t h e Read/Write a n d User/Supervisor fla g s . Mo re o ve r, t h e User/Supervisor fla g o f a n y p a g e in clu d e d in a m e m o ry re g io n m u s t a lwa ys b e s e t , s in ce t h e p a g e m u s t a lwa ys b e a cce s s ib le b y Us e r Mo d e p ro ce s s e s .

To o ve rco m e t h e h a rd wa re lim it a t io n o f t h e 8 0 x 8 6 m icro p ro ce s s o rs , Lin u x a d o p t s t h e fo llo win g ru le s : ● ●

Th e re a d a cce s s rig h t a lwa ys im p lie s t h e e xe cu t e a cce s s rig h t . Th e writ e a cce s s rig h t a lwa ys im p lie s t h e re a d a cce s s rig h t .

Mo re o ve r, t o co rre ct ly d e fe r t h e a llo ca t io n o f p a g e fra m e s t h ro u g h t h e S e ct io n 8 . 4 . 4 t e ch n iq u e ( s e e la t e r in t h is ch a p t e r) , t h e p a g e fra m e is writ e - p ro t e ct e d wh e n e ve r t h e co rre s p o n d in g p a g e m u s t n o t b e s h a re d b y s e ve ra l p ro ce s s e s . Th e re fo re , t h e 1 6 p o s s ib le co m b in a t io n s o f t h e re a d , writ e , e xe cu t e , a n d s h a re a cce s s rig h t s a re s ca le d d o wn t o t h e fo llo win g t h re e : ●

If t h e p a g e h a s b o t h writ e a n d s h a re a cce s s rig h t s , t h e Read/Write b it is s e t .



If t h e p a g e h a s t h e re a d o r e xe cu t e a cce s s rig h t b u t d o e s n o t h a ve e it h e r t h e writ e o r t h e s h a re a cce s s rig h t , t h e Read/Write b it is cle a re d .



If t h e p a g e d o e s n o t h a ve a n y a cce s s rig h t s , t h e Present b it is cle a re d s o t h a t e a ch a cce s s g e n e ra t e s a Pa g e Fa u lt e xce p t io n . Ho we ve r, t o d is t in g u is h t h is co n d it io n fro m t h e re a l p a g e n o t - p re s e n t ca s e , Lin u x a ls o s e t s t h e Page size b it t o 1 . [ 4 ]

[4]

Yo u m ig h t co n s id e r t h is u s e o f t h e Page size b it t o b e a d irt y t rick, s in ce t h e b it wa s m e a n t t o in d ica t e t h e re a l p a g e s ize . Bu t Lin u x ca n g e t a wa y wit h t h e d e ce p t io n b e ca u s e t h e 8 0 x 8 6 ch ip ch e cks t h e Page size b it in Pa g e Dire ct o ry e n t rie s , b u t n o t in Pa g e Ta b le e n t rie s . Th e d o wn s ca le d p ro t e ct io n b it s co rre s p o n d in g t o e a ch co m b in a t io n o f a cce s s rig h t s a re s t o re d in t h e protection_map a rra y.

8.3.3 Memory Region Handling Ha vin g t h e b a s ic u n d e rs t a n d in g o f d a t a s t ru ct u re s a n d s t a t e in fo rm a t io n t h a t co n t ro l m e m o ry h a n d lin g , we ca n lo o k a t a g ro u p o f lo w- le ve l fu n ct io n s t h a t o p e ra t e o n m e m o ry re g io n d e s crip t o rs . Th e y s h o u ld b e co n s id e re d a u xilia ry fu n ct io n s t h a t s im p lify t h e im p le m e n t a t io n o f do_mmap( ) a n d

do_munmap( ). Th o s e t wo fu n ct io n s , wh ich a re d e s crib e d in S e ct io n 8 . 3 . 4 a n d S e ct io n 8 . 3 . 5 la t e r in t h is ch a p t e r, e n la rg e a n d s h rin k t h e a d d re s s s p a ce o f a p ro ce s s , re s p e ct ive ly. Wo rkin g a t a h ig h e r le ve l t h a n t h e fu n ct io n s we co n s id e r h e re , t h e y d o n o t re ce ive a m e m o ry re g io n d e s crip t o r a s t h e ir p a ra m e t e r, b u t ra t h e r t h e in it ia l a d d re s s , t h e le n g t h , a n d t h e a cce s s rig h t s o f a lin e a r a d d re s s in t e rva l.

8.3.3.1 Finding the closest region to a given address: find_vma( ) Th e find_vma( ) fu n ct io n a ct s o n t wo p a ra m e t e rs : t h e a d d re s s mm o f a p ro ce s s m e m o ry d e s crip t o r a n d a lin e a r a d d re s s addr. It lo ca t e s t h e firs t m e m o ry re g io n wh o s e vm_end fie ld is g re a t e r t h a n addr a n d re t u rn s t h e a d d re s s o f it s d e s crip t o r; if n o s u ch re g io n e xis t s , it re t u rn s a

NULL p o in t e r. No t ice t h a t t h e re g io n s e le ct e d b y find_vma( ) d o e s n o t n e ce s s a rily in clu d e addr b e ca u s e addr m a y lie o u t s id e o f a n y m e m o ry re g io n . Ea ch m e m o ry d e s crip t o r in clu d e s a mmap_cache fie ld t h a t s t o re s t h e d e s crip t o r a d d re s s o f t h e re g io n t h a t wa s la s t re fe re n ce d b y t h e p ro ce s s . Th is a d d it io n a l fie ld is in t ro d u ce d t o re d u ce t h e t im e s p e n t in lo o kin g fo r t h e re g io n t h a t co n t a in s a g ive n lin e a r a d d re s s . Lo ca lit y o f a d d re s s re fe re n ce s in p ro g ra m s m a ke s it h ig h ly like ly t h a t if t h e la s t lin e a r a d d re s s ch e cke d b e lo n g e d t o a g ive n re g io n , t h e n e xt o n e t o b e ch e cke d b e lo n g s t o t h e s a m e re g io n . Th e fu n ct io n t h u s s t a rt s b y ch e ckin g wh e t h e r t h e re g io n id e n t ifie d b y mmap_cache in clu d e s addr. If s o , it re t u rn s t h e re g io n d e s crip t o r p o in t e r:

vma = mm->mmap_cache; if (vma && vma->vm_end > addr && vma->vm_start mm_rb.rb_node; vma = NULL; while (rb_node) { vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (vma_tmp->vm_end > addr) { vma = vma_tmp; if (vma_tmp->vm_start rb_left; } else rb_node = rb_node->rb_right; } if (vma) mm->mmap_cache = vma; return vma; Th e fu n ct io n u s e s t h e rb_entry m a cro , wh ich d e rive s fro m a p o in t e r t o a n o d e o f t h e re d - b la ck t re e t h e a d d re s s o f t h e co rre s p o n d in g m e m o ry re g io n d e s crip t o r.

Th e ke rn e l a ls o d e fin e s t h e find_vma_prev( ) fu n ct io n ( wh ich re t u rn s t h e d e s crip t o r a d d re s s e s o f t h e m e m o ry re g io n t h a t p re ce d e s t h e lin e a r a d d re s s g ive n a s p a ra m e t e r a n d o f t h e m e m o ry re g io n t h a t fo llo ws it ) a n d t h e find_vma_prepare( ) fu n ct io n ( wh ich lo ca t e s t h e p o s it io n o f t h e n e w le a f in t h e re d - b la ck t re e t h a t co rre s p o n d s t o a g ive n lin e a r a d d re s s a n d re t u rn s t h e a d d re s s e s o f t h e p re ce d in g m e m o ry re g io n a n d o f t h e p a re n t n o d e o f t h e le a f t o b e in s e rt e d ) .

8.3.3.2 Finding a region that overlaps a given interval: find_vma_intersection( ) Th e find_vma_intersection( ) fu n ct io n fin d s t h e firs t m e m o ry re g io n t h a t o ve rla p s a g ive n lin e a r a d d re s s in t e rva l; t h e mm p a ra m e t e r p o in t s t o t h e m e m o ry d e s crip t o r o f t h e p ro ce s s , wh ile t h e

start_addr a n d end_addr lin e a r a d d re s s e s s p e cify t h e in t e rva l: vma = find_vma(mm,start_addr); if (vma && end_addr vm_start) vma = NULL; return vma; Th e fu n ct io n re t u rn s a NULL p o in t e r if n o s u ch re g io n e xis t s . To b e e xa ct , if find_vma( ) re t u rn s a va lid a d d re s s b u t t h e m e m o ry re g io n fo u n d s t a rt s a ft e r t h e e n d o f t h e lin e a r a d d re s s in t e rva l, vma is s e t t o NULL.

8.3.3.3 Finding a free interval: arch_get_unmapped_area( ) Th e arch_get_unmapped_area( ) fu n ct io n s e a rch e s t h e p ro ce s s a d d re s s s p a ce t o fin d a n a va ila b le lin e a r a d d re s s in t e rva l. Th e len p a ra m e t e r s p e cifie s t h e in t e rva l le n g t h , wh ile t h e addr p a ra m e t e r m a y s p e cify t h e a d d re s s fro m wh ich t h e s e a rch is s t a rt e d . If t h e s e a rch is s u cce s s fu l, t h e fu n ct io n re t u rn s t h e in it ia l a d d re s s o f t h e n e w in t e rva l; o t h e rwis e , it re t u rn s t h e e rro r co d e -

ENOMEM. if (len > TASK_SIZE) return -ENOMEM; addr = (addr + 0xfff) & 0xfffff000; if (addr && addr + len mm, addr); if (!vma || addr + len vm_start) return addr; } addr = (TASK_SIZE/3 + 0xfff) & 0xfffff000; for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) { if (addr + len > TASK_SIZE) return -ENOMEM; if (!vma || addr + len vm_start) return addr; addr = vma->vm_end; } Th e fu n ct io n s t a rt s b y ch e ckin g t o m a ke s u re t h e in t e rva l le n g t h is wit h in t h e lim it im p o s e d o n Us e r Mo d e lin e a r a d d re s s e s , u s u a lly 3 GB. If addr is d iffe re n t fro m ze ro , t h e fu n ct io n t rie s t o a llo ca t e t h e in t e rva l s t a rt in g fro m addr. To b e o n t h e s a fe s id e , t h e fu n ct io n ro u n d s u p t h e va lu e o f addr t o a m u lt ip le o f 4 KB. If addr is 0 o r t h e p re vio u s s e a rch fa ile d , t h e s e a rch 's s t a rt in g p o in t is s e t t o o n e t h ird o f t h e Us e r Mo d e lin e a r a d d re s s s p a ce . S t a rt in g fro m addr, t h e fu n ct io n t h e n re p e a t e d ly in vo ke s find_vma( ) wit h in cre a s in g va lu e s o f addr t o fin d t h e re q u ire d fre e in t e rva l. Du rin g t h is s e a rch , t h e fo llo win g ca s e s m a y o ccu r:



Th e re q u e s t e d in t e rva l is la rg e r t h a n t h e p o rt io n o f lin e a r a d d re s s s p a ce ye t t o b e s ca n n e d ( addr + len > TASK_SIZE) . S in ce t h e re a re n o t e n o u g h lin e a r a d d re s s e s t o s a t is fy t h e re q u e s t , re t u rn -ENOMEM.



Th e h o le fo llo win g t h e la s t s ca n n e d re g io n is n o t la rg e e n o u g h ( vma != NULL && vma-

>vm_start < addr + len) . Co n s id e r t h e n e xt re g io n . ●

If n e it h e r o n e o f t h e p re ce d in g co n d it io n s h o ld s , a la rg e e n o u g h h o le h a s b e e n fo u n d . Re t u rn addr.

8.3.3.4 Inserting a region in the memory descriptor list: insert_vm_struct( )

insert_vm_struct( ) in s e rt s a vm_area_struct s t ru ct u re in t h e m e m o ry re g io n o b je ct lis t a n d re d - b la ck t re e o f a m e m o ry d e s crip t o r. It u s e s t wo p a ra m e t e rs : mm, wh ich s p e cifie s t h e a d d re s s o f a p ro ce s s m e m o ry d e s crip t o r, a n d vma, wh ich s p e cifie s t h e a d d re s s o f t h e vm_area_struct o b je ct t o b e in s e rt e d . Th e vm_start a n d vm_end fie ld s o f t h e m e m o ry re g io n o b je ct m u s t h a ve a lre a d y b e e n in it ia lize d . Th e fu n ct io n in vo ke s t h e find_vma_prepare( ) fu n ct io n t o lo o k u p t h e p o s it io n in t h e re d - b la ck t re e mm->mm_rb wh e re vma s h o u ld g o . Th e n insert_vm_struct( ) in vo ke s t h e vma_link( ) fu n ct io n , wh ich in t u rn : 1 . Acq u ire s t h e mm->page_table_lock s p in lo ck.

2 . In s e rt s t h e m e m o ry re g io n in t h e lin ke d lis t re fe re n ce d b y mm->mmap.

3 . In s e rt s t h e m e m o ry re g io n in t h e re d - b la ck t re e mm->mm_rb.

4 . Re le a s e s t h e mm->page_table_lock s p in lo ck.

5 . In cre m e n t s b y 1 t h e mm->map_count co u n t e r.

If t h e re g io n co n t a in s a m e m o ry- m a p p e d file , t h e vma_link( ) fu n ct io n p e rfo rm s a d d it io n a l t a s ks t h a t a re d e s crib e d in Ch a p t e r 1 6 . Th e ke rn e l a ls o d e fin e s t h e _ _insert_vm_struct( ) fu n ct io n , wh ich is id e n t ica l t o

insert_vm_struct( ) b u t d o e s n 't a cq u ire a n y lo ck b e fo re m o d ifyin g t h e m e m o ry re g io n d a t a s t ru ct u re s re fe re n ce d b y mm. Th e ke rn e l u s e s it wh e n it is s u re t h a t n o co n cu rre n t a cce s s e s t o t h e m e m o ry re g io n d a t a s t ru ct u re s ca n h a p p e n —fo r in s t a n ce , b e ca u s e it a lre a d y a cq u ire d a s u it a b le lo ck. Th e _ _vma_unlink( ) fu n ct io n re ce ive s a s p a ra m e t e r a m e m o ry d e s crip t o r a d d re s s mm a n d t wo m e m o ry re g io n o b je ct a d d re s s e s vma a n d prev. Bo t h m e m o ry re g io n s s h o u ld b e lo n g t o mm, a n d

prev s h o u ld p re ce d e vma in t h e m e m o ry re g io n o rd e rin g . Th e fu n ct io n re m o ve s vma fro m t h e lin ke d lis t a n d t h e re d - b la ck t re e o f t h e m e m o ry d e s crip t o r.

8.3.4 Allocating a Linear Address Interval No w le t 's d is cu s s h o w n e w lin e a r a d d re s s in t e rva ls a re a llo ca t e d . To d o t h is , t h e do_mmap( ) fu n ct io n cre a t e s a n d in it ia lize s a n e w m e m o ry re g io n fo r t h e current p ro ce s s . Ho we ve r, a ft e r a s u cce s s fu l a llo ca t io n , t h e m e m o ry re g io n co u ld b e m e rg e d wit h o t h e r m e m o ry re g io n s d e fin e d fo r t h e p ro ce s s . Th e fu n ct io n u s e s t h e fo llo win g p a ra m e t e rs :

file a n d offset File d e s crip t o r p o in t e r file a n d file o ffs e t offset a re u s e d if t h e n e w m e m o ry re g io n will m a p a file in t o m e m o ry. Th is t o p ic is d is cu s s e d in Ch a p t e r 1 5 . In t h is s e ct io n , we a s s u m e t h a t n o m e m o ry m a p p in g is re q u ire d a n d t h a t file a n d offset a re b o t h NULL.

addr Th is lin e a r a d d re s s s p e cifie s wh e re t h e s e a rch fo r a fre e in t e rva l m u s t s t a rt .

len Th e le n g t h o f t h e lin e a r a d d re s s in t e rva l.

prot Th is p a ra m e t e r s p e cifie s t h e a cce s s rig h t s o f t h e p a g e s in clu d e d in t h e m e m o ry re g io n . Po s s ib le fla g s a re PROT_READ, PROT_WRITE, PROT_EXEC, a n d PROT_NONE. Th e firs t t h re e fla g s m e a n t h e s a m e t h in g s a s t h e VM_READ, VM_WRITE, a n d VM_EXEC fla g s . PROT_NONE in d ica t e s t h a t t h e p ro ce s s h a s n o n e o f t h o s e a cce s s rig h t s .

flag Th is p a ra m e t e r s p e cifie s t h e re m a in in g m e m o ry re g io n fla g s : MAP_GROWSDOWN, MAP_LOCKED, MAP_DENYWRITE, a n d MAP_EXECUTABLE

Th e ir m e a n in g s a re id e n t ica l t o t h o s e o f t h e fla g s lis t e d in Ta b le 8 - 4 .

MAP_SHARED a n d MAP_PRIVATE

Th e fo rm e r fla g s p e cifie s t h a t t h e p a g e s in t h e m e m o ry re g io n ca n b e s h a re d a m o n g s e ve ra l p ro ce s s e s ; t h e la t t e r fla g h a s t h e o p p o s it e e ffe ct . Bo t h fla g s re fe r t o t h e VM_SHARED fla g in t h e vm_area_struct d e s crip t o r.

MAP_ANONYMOUS

No file is a s s o cia t e d wit h t h e m e m o ry re g io n ( s e e Ch a p t e r 1 5 ) .

MAP_FIXED

Th e in it ia l lin e a r a d d re s s o f t h e in t e rva l m u s t b e e xa ct ly t h e o n e s p e cifie d in t h e addr p a ra m e t e r.

MAP_NORESERVE

Th e fu n ct io n d o e s n 't h a ve t o d o a p re lim in a ry ch e ck o n t h e n u m b e r o f fre e p a g e fra m e s .

Th e do_mmap( ) fu n ct io n p e rfo rm s s o m e p re lim in a ry ch e cks o n t h e va lu e o f offset a n d t h e n e xe cu t e s t h e do_mmap_pgoff() fu n ct io n . As s u m in g t h a t t h e n e w in t e rva l o f lin e a r a d d re s s d o e s n o t m a p a file o n d is k, t h e la t t e r fu n ct io n e xe cu t e s t h e fo llo win g s t e p s : 1 . Ch e cks wh e t h e r t h e p a ra m e t e r va lu e s a re co rre ct a n d wh e t h e r t h e re q u e s t ca n b e s a t is fie d . In p a rt icu la r, it ch e cks fo r t h e fo llo win g co n d it io n s t h a t p re ve n t it fro m s a t is fyin g t h e

re q u e s t :



Th e lin e a r a d d re s s in t e rva l h a s ze ro le n g t h o r in clu d e s a d d re s s e s g re a t e r t h a n

TASK_SIZE.



Th e p ro ce s s h a s a lre a d y m a p p e d t o o m a n y m e m o ry re g io n s , s o t h e va lu e o f t h e

map_count fie ld o f it s mm m e m o ry d e s crip t o r e xce e d s MAX_MAP_COUNT.



Th e flag p a ra m e t e r s p e cifie s t h a t t h e p a g e s o f t h e n e w lin e a r a d d re s s in t e rva l m u s t b e lo cke d in RAM, a n d t h e n u m b e r o f p a g e s lo cke d b y t h e p ro ce s s e xce e d s t h e t h re s h o ld s t o re d in t h e rlim[RLIMIT_MEMLOCK].rlim_cur fie ld o f t h e p ro ce s s d e s crip t o r.

If a n y o f t h e p re ce d in g co n d it io n s h o ld s , do_mmap_pgoff( ) t e rm in a t e s b y re t u rn in g a n e g a t ive va lu e . If t h e lin e a r a d d re s s in t e rva l h a s a ze ro le n g t h , t h e fu n ct io n re t u rn s wit h o u t p e rfo rm in g a n y a ct io n . 2 . Ob t a in s a lin e a r a d d re s s in t e rva l fo r t h e n e w re g io n ; if t h e MAP_FIXED fla g is s e t , a ch e ck is m a d e o n t h e addr va lu e . Ot h e rwis e , t h e arch_get_unmapped_area( ) fu n ct io n is in vo ke d t o g e t it : if (flags & MAP_FIXED) { if (addr + len > TASK_SIZE) return -ENOMEM; if (addr & ~PAGE_MASK) return -EINVAL; } else addr = arch_get_unmapped_area(file, addr, len, pgoff, flags);

3 . Co m p u t e s t h e fla g s o f t h e n e w m e m o ry re g io n b y co m b in in g t h e va lu e s s t o re d in t h e prot a n d flags p a ra m e t e rs : vm_flags = calc_vm_flags(prot,flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; if (flags & MAP_SHARED) vm_flags |= VM_SHARED | VM_MAYSHARE;

Th e calc_vm_flags( ) fu n ct io n s e t s t h e VM_READ, VM_WRITE, a n d VM_EXEC fla g s in

vm_flags o n ly if t h e co rre s p o n d in g PROT_READ, PROT_WRITE, a n d PROT_EXEC fla g s in prot a re s e t ; it a ls o s e t s t h e VM_GROWSDOWN, VM_DENYWRITE, a n d VM_EXECUTABLE fla g s in vm_flags o n ly if t h e co rre s p o n d in g MAP_GROWSDOWN, MAP_DENYWRITE, a n d MAP_EXECUTABLE fla g s in flags a re s e t . A fe w o t h e r fla g s a re s e t t o 1 in vm_flags: VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, t h e d e fa u lt fla g s fo r a ll m e m o ry re g io n s in mm>def_flags, [ 5 ] a n d b o t h VM_SHARED a n d VM_MAYSHARE if t h e m e m o ry re g io n h a s t o b e s h a re d wit h o t h e r p ro ce s s e s . [5]

Act u a lly, t h e def_flags fie ld o f t h e m e m o ry d e s crip t o r is m o d ifie d o n ly

b y t h e mlockall( ) s ys t e m ca ll, wh ich ca n b e u s e d t o s e t t h e VM_LOCKED fla g , t h u s lo ckin g a ll fu t u re p a g e s o f t h e ca llin g p ro ce s s in RAM. 4 . In vo ke s find_vma_prepare( ) t o lo ca t e t h e o b je ct o f t h e m e m o ry re g io n t h a t s h a ll p re ce d e t h e n e w in t e rva l, a s we ll a s t h e p o s it io n o f t h e n e w re g io n in t h e re d - b la ck t re e :

for (;;) { vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); if (!vma || vma->vm_start >= addr + len) break; if (do_munmap(mm, addr, len)) return -ENOMEM; }

Th e find_vma_prepare( ) fu n ct io n a ls o ch e cks wh e t h e r a m e m o ry re g io n t h a t o ve rla p s t h e n e w in t e rva l a lre a d y e xis t s . Th is o ccu rs wh e n t h e fu n ct io n re t u rn s a n o n - NULL a d d re s s p o in t in g t o a re g io n t h a t s t a rt s b e fo re t h e e n d o f t h e n e w in t e rva l. In t h is ca s e , do_mmap_pgoff( ) in vo ke s do_munmap( ) t o re m o ve t h e n e w in t e rva l a n d t h e n re p e a t s t h e wh o le s t e p ( s e e t h e la t e r s e ct io n S e ct io n 8 . 3 . 5 ) . 5 . Ch e cks wh e t h e r in s e rt in g t h e n e w m e m o ry re g io n ca u s e s t h e s ize o f t h e p ro ce s s a d d re s s s p a ce (mm->total_vmvm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = protection_map[vm_flags & 0x0f]; vma->vm_ops = NULL; vma->vm_pgoff = pgoff; vma->vm_file = NULL; vma->vm_private_data = NULL; vma->vm_raend = 0;

1 0 . If t h e MAP_SHARED fla g is s e t ( a n d t h e n e w m e m o ry re g io n d o e s n 't m a p a file o n d is k) , t h e re g io n is u s e d fo r IPC s h a re d m e m o ry. Th e fu n ct io n in vo ke s shmem_zero_setup( ) t o in it ia lize it ( s e e Ch a p t e r 1 9 ) .

1 1 . In vo ke s vma_link( ) t o in s e rt t h e n e w re g io n in t h e m e m o ry re g io n lis t a n d re d - b la ck t re e ( s e e t h e e a rlie r s e ct io n S e ct io n 8 . 3 . 3 . 4 ) . 1 2 . In cre m e n t s t h e s ize o f t h e p ro ce s s a d d re s s s p a ce s t o re d in t h e total_vm fie ld o f t h e m e m o ry d e s crip t o r. 1 3 . If t h e VM_LOCKED fla g is s e t , in cre m e n t s t h e co u n t e r o f lo cke d p a g e s mm->locked_vm: if (vm_flags & VM_LOCKED) { mm->locked_vm += len >> PAGE_SHIFT; make_pages_present(addr, addr + len); }

Th e make_pages_present( ) fu n ct io n is in vo ke d t o a llo ca t e a ll t h e p a g e s o f t h e m e m o ry re g io n in s u cce s s io n a n d lo ck t h e m in RAM. Th e fu n ct io n , in t u rn , in vo ke s get_user_pages( ) a s fo llo ws : write = (vma->vm_flags & VM_WRITE) != 0; get_user_pages(current, current->mm, addr, len, write, 0, NULL, NULL);

Th e get_user_pages( ) fu n ct io n cycle s t h ro u g h a ll s t a rt in g lin e a r a d d re s s e s o f t h e p a g e s b e t we e n addr a n d addr+len; fo r e a ch o f t h e m , it in vo ke s follow_page( ) t o ch e ck wh e t h e r t h e re is a m a p p in g t o a p h ys ica l p a g e in t h e current's Pa g e Ta b le s . If n o s u ch p h ys ica l p a g e e xis t s , get_user_pages( ) in vo ke s handle_mm_fault( ), wh ich , a s we s h a ll s e e in S e ct io n 8 . 4 . 2 , a llo ca t e s o n e p a g e fra m e a n d s e t s it s Pa g e Ta b le e n t ry a cco rd in g t o t h e vm_flags fie ld o f t h e m e m o ry re g io n d e s crip t o r.

1 4 . Fin a lly, t e rm in a t e s b y re t u rn in g t h e lin e a r a d d re s s o f t h e n e w m e m o ry re g io n .

8.3.5 Releasing a Linear Address Interval Th e do_munmap( ) fu n ct io n d e le t e s a lin e a r a d d re s s in t e rva l fro m t h e a d d re s s s p a ce o f t h e cu rre n t p ro ce s s . Th e p a ra m e t e rs a re t h e s t a rt in g a d d re s s addr o f t h e in t e rva l a n d it s le n g t h len. Th e in t e rva l t o b e d e le t e d d o e s n o t u s u a lly co rre s p o n d t o a m e m o ry re g io n ; it m a y b e in clu d e d in o n e m e m o ry re g io n o r s p a n t wo o r m o re re g io n s . Th e fu n ct io n g o e s t h ro u g h t wo m a in p h a s e s . Firs t , it s ca n s t h e lis t o f m e m o ry re g io n s o wn e d b y t h e p ro ce s s a n d re m o ve s a ll re g io n s t h a t o ve rla p t h e lin e a r a d d re s s in t e rva l. In t h e s e co n d p h a s e , t h e fu n ct io n u p d a t e s t h e p ro ce s s Pa g e Ta b le s a n d re in s e rt s a d o wn s ize d ve rs io n o f t h e m e m o ry re g io n s t h a t we re re m o ve d d u rin g t h e firs t p h a s e .

8.3.5.1 First phase: scanning the memory regions Th e do_munmap( ) fu n ct io n e xe cu t e s t h e fo llo win g s t e p s :

1 . Pe rfo rm s s o m e p re lim in a ry ch e cks o n t h e p a ra m e t e r va lu e s . If t h e lin e a r a d d re s s in t e rva l in clu d e s a d d re s s e s g re a t e r t h a n TASK_SIZE, if addr is n o t a m u lt ip le o f 4 , 0 9 6 , o r if t h e lin e a r a d d re s s in t e rva l h a s a ze ro le n g t h , it re t u rn s t h e e rro r co d e -EINVAL.

2 . Lo ca t e s t h e firs t m e m o ry re g io n t h a t o ve rla p s t h e lin e a r a d d re s s in t e rva l t o b e d e le t e d : mpnt = find_vma_prev(current->mm, addr, &prev);

if (!mpnt || mpnt->vm_start >= addr + len) return 0;

3 . If t h e lin e a r a d d re s s in t e rva l is lo ca t e d in s id e a m e m o ry re g io n , it s d e le t io n s p lit s t h e re g io n in t o t wo s m a lle r o n e s . In t h is ca s e , do_munmap( ) ch e cks wh e t h e r current is a llo we d t o o b t a in a n a d d it io n a l m e m o ry re g io n : if ((mpnt->vm_start < addr && mpnt->vm_end > addr + len) && current->mm->map_count > MAX_MAP_COUNT) return -ENOMEM;

4 . At t e m p t s t o g e t a n e w vm_area_struct d e s crip t o r. Th e re m a y b e n o n e e d fo r it , b u t t h e fu n ct io n m a ke s t h e re q u e s t a n ywa y s o t h a t it ca n t e rm in a t e rig h t a wa y if t h e a llo ca t io n fa ils . Th is ca u t io u s a p p ro a ch s im p lifie s t h e co d e s in ce it a llo ws a n e a s y e rro r e xit . 5 . Bu ild s u p a lis t t h a t in clu d e s a ll d e s crip t o rs o f t h e m e m o ry re g io n s t h a t o ve rla p t h e lin e a r a d d re s s in t e rva l. Th is lis t is cre a t e d b y s e t t in g t h e vm_next fie ld o f t h e m e m o ry re g io n d e s crip t o r ( t e m p o ra rily) s o it p o in t s t o t h e p re vio u s it e m in t h e lis t ; t h is fie ld t h u s a ct s a s a b a ckwa rd lin k. As e a ch re g io n is a d d e d t o t h is b a ckwa rd lis t , a lo ca l va ria b le n a m e d free p o in t s t o t h e la s t in s e rt e d e le m e n t . Th e re g io n s in s e rt e d in t h e lis t a re a ls o re m o ve d fro m t h e lis t o f m e m o ry re g io n s o wn e d b y t h e p ro ce s s a n d fro m t h e re d - b la ck t re e ( b y m e a n s o f t h e rb_erase( ) fu n ct io n ) : npp = (prev ? &prev->vm_next : ¤t->mm->mmap); free = NULL; spin_lock(¤t->mm->page_table_lock); for ( ; mpnt && mpnt->vm_start < addr + len; mpnt = *npp) { *npp = mpnt->vm_next; mpnt->vm_next = free; free = mpnt; rb_erase(&mpnt->vm_rb, ¤t->mm->mm_rb); } current->mm->mmap_cache = NULL; spin_unlock(¤t->mm->page_table_lock);

8.3.5.2 Second phase: updating the Page Tables A while cycle is u s e d t o s ca n t h e lis t o f m e m o ry re g io n s b u ilt in t h e firs t p h a s e , s t a rt in g wit h t h e m e m o ry re g io n d e s crip t o r t h a t free p o in t s t o .

In e a ch it e ra t io n , t h e mpnt lo ca l va ria b le p o in t s t o t h e d e s crip t o r o f a m e m o ry re g io n in t h e lis t . Th e

map_count fie ld o f t h e current->mm m e m o ry d e s crip t o r is d e cre m e n t e d ( s in ce t h e re g io n h a s b e e n re m o ve d in t h e firs t p h a s e fro m t h e lis t o f re g io n s o wn e d b y t h e p ro ce s s ) a n d a ch e ck is m a d e ( b y m e a n s o f t wo q u e s t io n - m a rk co n d it io n a l e xp re s s io n s ) t o d e t e rm in e wh e t h e r t h e mpnt re g io n m u s t b e e lim in a t e d o r s im p ly d o wn s ize d :

current->mm->map_count--; st = addr < mpnt->vm_start ? mpnt->vm_start : addr; end = addr+len; end = end > mpnt->vm_end ? mpnt->vm_end : end; size = end - st; Th e st a n d end lo ca l va ria b le s d e lim it t h e lin e a r a d d re s s in t e rva l in t h e mpnt m e m o ry re g io n t h a t s h o u ld b e d e le t e d ; t h e size lo ca l va ria b le s p e cifie s t h e le n g t h o f t h e in t e rva l.

Ne xt , do_munmap( ) re le a s e s t h e p a g e fra m e s a llo ca t e d fo r t h e p a g e s in clu d e d in t h e in t e rva l fro m

st t o end: zap_page_range(mm, st, size); Th e zap_page_range( ) fu n ct io n d e a llo ca t e s t h e p a g e fra m e s in clu d e d in t h e in t e rva l fro m st t o

end a n d u p d a t e s t h e co rre s p o n d in g Pa g e Ta b le e n t rie s . Th e fu n ct io n in vo ke s in n e s t e d fa s h io n t h e zap_pmd_range( ) a n d zap_pte_range( ) fu n ct io n s fo r s ca n n in g t h e Pa g e Ta b le s ; t h e la t t e r fu n ct io n cle a rs t h e Pa g e Ta b le e n t rie s a n d fre e s t h e co rre s p o n d in g p a g e fra m e s ( o r s lo t in a s wa p a re a ; s e e Ch a p t e r 1 4 ) . Wh ile d o in g t h is , zap_pte_range( ) a ls o in va lid a t e s t h e TLB e n t rie s co rre s p o n d in g t o t h e in t e rva l fro m st t o end.

Th e la s t a ct io n p e rfo rm e d in e a ch it e ra t io n o f t h e do_munmap( ) lo o p is t o ch e ck wh e t h e r a d o wn s ize d ve rs io n o f t h e mpnt m e m o ry re g io n m u s t b e re in s e rt e d in t h e lis t o f re g io n s o f current:

extra = unmap_fixup(mm, mpnt, st, size, extra); Th e unmap_fixup( ) fu n ct io n co n s id e rs fo u r p o s s ib le ca s e s :

1 . Th e m e m o ry re g io n h a s b e e n t o t a lly ca n ce le d . It re t u rn s t h e a d d re s s o f t h e p re vio u s ly a llo ca t e d m e m o ry re g io n o b je ct ( s e e S t e p 4 in t h e e a rlie r s e ct io n S e ct io n 8 . 3 . 5 . 1 ) , wh ich ca n b e re le a s e d b y in vo kin g kmem_cache_free( ).

2 . On ly t h e lo we r p a rt o f t h e m e m o ry re g io n h a s b e e n re m o ve d : (mpnt->vm_start < st) && (mpnt->vm_end == end)

In t h is ca s e , it u p d a t e s t h e vm_end fie ld o f mnpt, in vo ke s _ _insert_vm_struct( ) t o in s e rt t h e d o wn s ize d re g io n in t h e lis t o f re g io n s b e lo n g in g t o t h e p ro ce s s , a n d re t u rn s t h e a d d re s s o f t h e p re vio u s ly a llo ca t e d m e m o ry re g io n o b je ct . 3 . On ly t h e u p p e r p a rt o f t h e m e m o ry re g io n h a s b e e n re m o ve d : (mpnt->vm_start == st) && (mpnt->vm_end > end)

In t h is ca s e , it u p d a t e s t h e vm_start fie ld o f mnpt, in vo ke s _ _insert_vm_struct( ) t o in s e rt t h e d o wn s ize d re g io n in t h e lis t o f re g io n s b e lo n g in g t o t h e p ro ce s s , a n d re t u rn s t h e a d d re s s o f t h e p re vio u s ly a llo ca t e d m e m o ry o b je ct . 4 . Th e lin e a r a d d re s s in t e rva l is in t h e m id d le o f t h e m e m o ry re g io n : (mpnt->vm_start < st) && (mpnt->vm_end > end)

It u p d a t e s t h e vm_start a n d vm_end fie ld s o f mnpt a n d o f t h e p re vio u s ly a llo ca t e d e xt ra m e m o ry re g io n o b je ct s o t h a t t h e y re fe r t o t h e lin e a r a d d re s s in t e rva ls , re s p e ct ive ly, fro m mpnt->vm_start t o st a n d fro m end t o mpnt->vm_end. Th e n it in vo ke s _

_insert_vm_struct( ) t wice t o in s e rt t h e t wo re g io n s in t h e lis t o f re g io n s b e lo n g in g t o t h e p ro ce s s a n d in t h e re d - b la ck t re e , a n d re t u rn s NULL, t h u s p re s e rvin g t h e m e m o ry re g io n o b je ct p re vio u s ly a llo ca t e d . Th is t e rm in a t e s t h e d e s crip t io n o f wh a t m u s t b e d o n e in a s in g le it e ra t io n o f t h e s e co n d - p h a s e lo o p

o f do_munmap( ).

Aft e r h a n d lin g a ll t h e m e m o ry re g io n d e s crip t o rs in t h e lis t b u ilt d u rin g t h e firs t p h a s e , do_munmap(

) ch e cks if t h e a d d it io n a l e xt ra m e m o ry d e s crip t o r h a s b e e n u s e d . If t h e a d d re s s re t u rn e d b y unmap_fixup( ) is NULL, t h e d e s crip t o r h a s b e e n u s e d ; o t h e rwis e , do_munmap( ) in vo ke s kmem_cache_free( ) t o re le a s e it . Fin a lly, do_munmap( ) in vo ke s t h e free_pgtables( ) fu n ct io n : it a g a in s ca n s t h e Pa g e Ta b le e n t rie s co rre s p o n d in g t o t h e lin e a r a d d re s s in t e rva l ju s t re m o ve d a n d re cla im s t h e p a g e fra m e s t h a t s t o re u n u s e d Pa g e Ta b le s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.4 Page Fault Exception Handler As s t a t e d p re vio u s ly, t h e Lin u x Pa g e Fa u lt e xce p t io n h a n d le r m u s t d is t in g u is h e xce p t io n s ca u s e d b y p ro g ra m m in g e rro rs fro m t h o s e ca u s e d b y a re fe re n ce t o a p a g e t h a t le g it im a t e ly b e lo n g s t o t h e p ro ce s s a d d re s s s p a ce b u t s im p ly h a s n 't b e e n a llo ca t e d ye t . Th e m e m o ry re g io n d e s crip t o rs a llo w t h e e xce p t io n h a n d le r t o p e rfo rm it s jo b q u it e e fficie n t ly. Th e

do_page_fault( ) fu n ct io n , wh ich is t h e Pa g e Fa u lt in t e rru p t s e rvice ro u t in e fo r t h e 8 0 x 8 6 a rch it e ct u re , co m p a re s t h e lin e a r a d d re s s t h a t ca u s e d t h e Pa g e Fa u lt a g a in s t t h e m e m o ry re g io n s o f t h e

current p ro ce s s ; it ca n t h u s d e t e rm in e t h e p ro p e r wa y t o h a n d le t h e e xce p t io n a cco rd in g t o t h e s ch e m e t h a t is illu s t ra t e d in Fig u re 8 - 4 . Fig u re 8 - 4 . Ov e ra ll s c h e m e fo r t h e P a g e Fa u lt h a n d le r

In p ra ct ice , t h in g s a re a lo t m o re co m p le x b e ca u s e t h e Pa g e Fa u lt h a n d le r m u s t re co g n ize s e ve ra l p a rt icu la r s u b ca s e s t h a t fit a wkwa rd ly in t o t h e o ve ra ll s ch e m e , a n d it m u s t d is t in g u is h s e ve ra l kin d s o f le g a l a cce s s . A d e t a ile d flo w d ia g ra m o f t h e h a n d le r is illu s t ra t e d in Fig u re 8 - 5 . Fig u re 8 - 5 . Th e flo w d ia g ra m o f t h e P a g e Fa u lt h a n d le r

Th e id e n t ifie rs vmalloc_fault, good_area, bad_area, a n d no_context a re la b e ls a p p e a rin g in

do_page_fault( ) t h a t s h o u ld h e lp yo u t o re la t e t h e b lo cks o f t h e flo w d ia g ra m t o s p e cific lin e s o f co d e . Th e do_ page_fault( ) fu n ct io n a cce p t s t h e fo llo win g in p u t p a ra m e t e rs :



Th e regs a d d re s s o f a pt_regs s t ru ct u re co n t a in in g t h e va lu e s o f t h e m icro p ro ce s s o r re g is t e rs



wh e n t h e e xce p t io n o ccu rre d . A 3 - b it error_code, wh ich is p u s h e d o n t h e s t a ck b y t h e co n t ro l u n it wh e n t h e e xce p t io n o ccu rre d ( s e e S e ct io n 4 . 2 . 4 ) . Th e b it s h a ve t h e fo llo win g m e a n in g s .







If b it 0 is cle a r, t h e e xce p t io n wa s ca u s e d b y a n a cce s s t o a p a g e t h a t is n o t p re s e n t ( t h e Present fla g in t h e Pa g e Ta b le e n t ry is cle a r) ; o t h e rwis e , if b it 0 is s e t , t h e e xce p t io n wa s ca u s e d b y a n in va lid a cce s s rig h t . If b it 1 is cle a r, t h e e xce p t io n wa s ca u s e d b y a re a d o r e xe cu t e a cce s s ; if s e t , t h e e xce p t io n wa s ca u s e d b y a writ e a cce s s . If b it 2 is cle a r, t h e e xce p t io n o ccu rre d wh ile t h e p ro ce s s o r wa s in Ke rn e l Mo d e ; o t h e rwis e , it o ccu rre d in Us e r Mo d e .

Th e firs t o p e ra t io n o f do_ page_fault( ) co n s is t s o f re a d in g t h e lin e a r a d d re s s t h a t ca u s e d t h e Pa g e Fa u lt . Wh e n t h e e xce p t io n o ccu rs , t h e CPU co n t ro l u n it s t o re s t h a t va lu e in t h e cr2 co n t ro l re g is t e r:

asm("movl %%cr2,%0":"=r" (address)); if (regs->eflags & 0x00000200)

local_irq_enable(); tsk = current; Th e lin e a r a d d re s s is s a ve d in t h e address lo ca l va ria b le . Th e fu n ct io n a ls o e n s u re s t h a t lo ca l in t e rru p t s a re e n a b le d if t h e y we re e n a b le d b e fo re t h e fa u lt a n d s a ve s t h e p o in t e rs t o t h e p ro ce s s d e s crip t o r o f current in t h e tsk lo ca l va ria b le .

As s h o wn a t t h e t o p o f Fig u re 8 - 5 , do_ page_fault( ) ch e cks wh e t h e r t h e fa u lt y lin e a r a d d re s s b e lo n g s t o t h e fo u rt h g ig a b yt e a n d t h e e xce p t io n wa s ca u s e d b y t h e ke rn e l t ryin g t o a cce s s a n o n e xis t in g p a g e fra m e :

if (address >= TASK_SIZE && !(error_code & 0x101)) goto vmalloc_fault; Th e co d e a t la b e l vmalloc_fault t a ke s ca re o f fa u lt s t h a t we re like ly ca u s e d b y a cce s s in g a n o n co n t ig u o u s m e m o ry a re a in Ke rn e l Mo d e ; we d e s crib e t h is ca s e in t h e la t e r s e ct io n S e ct io n 8 . 4 . 5 . Ne xt , t h e h a n d le r ch e cks wh e t h e r t h e e xce p t io n o ccu rre d wh ile h a n d lin g a n in t e rru p t o r e xe cu t in g a ke rn e l t h re a d ( re m e m b e r t h a t t h e mm fie ld o f t h e p ro ce s s d e s crip t o r is a lwa ys NULL fo r ke rn e l t h re a d s ) :

info.i_code = SEGV_MAPERR; if (in_interrupt( ) || !tsk->mm) goto no_context; In b o t h ca s e s , do_ page_fault( ) d o e s n o t t ry t o co m p a re t h e lin e a r a d d re s s wit h t h e m e m o ry re g io n s o f current, s in ce it wo u ld n o t m a ke a n y s e n s e : in t e rru p t h a n d le rs a n d ke rn e l t h re a d s n e ve r u s e lin e a r a d d re s s e s b e lo w TASK_SIZE, a n d t h u s n e ve r re ly o n m e m o ry re g io n s . ( S e e t h e n e xt s e ct io n fo r in fo rm a t io n o n t h e info lo ca l va ria b le a n d a d e s crip t io n o f t h e co d e a t t h e no_context la b e l. )

Le t 's s u p p o s e t h a t t h e Pa g e Fa u lt d id n o t o ccu r in a n in t e rru p t h a n d le r o r in a ke rn e l t h re a d . Th e n t h e fu n ct io n m u s t in s p e ct t h e m e m o ry re g io n s o wn e d b y t h e p ro ce s s t o d e t e rm in e wh e t h e r t h e fa u lt y lin e a r a d d re s s is in clu d e d in t h e p ro ce s s a d d re s s s p a ce :

down_read(&tsk->mm->mmap_sem); vma = find_vma(tsk->mm, address); if (!vma) goto bad_area; if (vma->vm_start vm_ops->nopage fie ld is n o t NULL. In t h is ca s e , t h e m e m o ry re g io n m a p s a d is k file a n d t h e fie ld p o in t s t o t h e fu n ct io n t h a t lo a d s t h e p a g e . Th is ca s e is co ve re d in S e ct io n 1 5 . 2 . 4 a n d in S e ct io n 1 9 . 3 . 5 .



Eit h e r t h e vm_ops fie ld o r t h e vma->vm_ops->nopage fie ld is NULL. In t h is ca s e , t h e m e m o ry re g io n d o e s n o t m a p a file o n d is k—i. e . , it is a n a n o n y m o u s m a p p in g . Th u s , do_no_ page( ) in vo ke s t h e do_anonymous_page( ) fu n ct io n t o g e t a n e w p a g e fra m e :

if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, write_access, address); Th e do_anonymous_page( ) fu n ct io n h a n d le s writ e a n d re a d re q u e s t s s e p a ra t e ly:

if (write_access) { spin_unlock(&mm->page_table_lock); page = alloc_page(GFP_HIGHUSER); addr = kmap_atomic(page, KM_USER0); memset((void *)(addr), 0, PAGE_SIZE); kunmap_atomic(addr, KM_USER0); spin_lock(&mm->page_table_lock); mm->rss++; entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); lru_cache_add(page); mark_page_accessed(page);

set_pte(page_table, entry); spin_unlock(&mm->page_table_lock); return 1; } Wh e n h a n d lin g a writ e a cce s s , t h e fu n ct io n in vo ke s alloc_page( ) a n d fills t h e n e w p a g e fra m e wit h ze ro s b y u s in g t h e memset m a cro . Th e fu n ct io n t h e n in cre m e n t s t h e min_flt fie ld o f tsk t o ke e p t ra ck o f t h e n u m b e r o f m in o r Pa g e Fa u lt s ca u s e d b y t h e p ro ce s s . Ne xt , t h e fu n ct io n in cre m e n t s t h e rss fie ld o f t h e m e m o ry d e s crip t o r t o ke e p t ra ck o f t h e n u m b e r o f p a g e fra m e s a llo ca t e d t o t h e p ro ce s s . [ 8 ] Th e Pa g e Ta b le e n t ry is t h e n s e t t o t h e p h ys ica l a d d re s s o f t h e p a g e fra m e , wh ich is m a rke d a s writ a b le a n d d irt y. Th e lru_cache_add( ) a n d mark_page_accessed( ) fu n ct io n s in s e rt t h e n e w p a g e fra m e in t h e s wa p - re la t e d d a t a s t ru ct u re s ; we d is cu s s t h e m in Ch a p t e r 1 6 . [8]

Lin u x re co rd s t h e n u m b e r o f m in o r a n d m a jo r Pa g e Fa u lt s fo r e a ch p ro ce s s . Th is in fo rm a t io n , t o g e t h e r wit h s e ve ra l o t h e r s t a t is t ics , m a y b e u s e d t o t u n e t h e s ys t e m .

Co n ve rs e ly, wh e n h a n d lin g a re a d a cce s s , t h e co n t e n t o f t h e p a g e is irre le va n t b e ca u s e t h e p ro ce s s is a d d re s s in g it fo r t h e firs t t im e . It is s a fe r t o g ive a p a g e fille d wit h ze ro s t o t h e p ro ce s s ra t h e r t h a n a n o ld p a g e fille d wit h in fo rm a t io n writ t e n b y s o m e o t h e r p ro ce s s . Lin u x g o e s o n e s t e p fu rt h e r in t h e s p irit o f d e m a n d p a g in g . Th e re is n o n e e d t o a s s ig n a n e w p a g e fra m e fille d wit h ze ro s t o t h e p ro ce s s rig h t a wa y, s in ce we m ig h t a s we ll g ive it a n e xis t in g p a g e ca lle d z e ro p a g e , t h u s d e fe rrin g fu rt h e r p a g e fra m e a llo ca t io n . Th e ze ro p a g e is a llo ca t e d s t a t ica lly d u rin g ke rn e l in it ia liza t io n in t h e empty_zero_page va ria b le ( a n a rra y o f 1 , 0 2 4 lo n g in t e g e rs fille d wit h ze ro s ) ; it is s t o re d in t h e fift h p a g e fra m e ( s t a rt in g fro m p h ys ica l a d d re s s 0x00004000) a n d ca n b e re fe re n ce d b y m e a n s o f t h e ZERO_PAGE m a cro .

Th e Pa g e Ta b le e n t ry is t h u s s e t wit h t h e p h ys ica l a d d re s s o f t h e ze ro p a g e :

entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot)); set_pte(page_table, entry); spin_unlock(&mm->page_table_lock); return 1; S in ce t h e p a g e is m a rke d a s n o n writ a b le , if t h e p ro ce s s a t t e m p t s t o writ e in it , t h e Co p y On Writ e m e ch a n is m is a ct iva t e d . On ly t h e n d o e s t h e p ro ce s s g e t a p a g e o f it s o wn t o writ e in . Th e m e ch a n is m is d e s crib e d in t h e n e xt s e ct io n .

8.4.4 Copy On Write Firs t - g e n e ra t io n Un ix s ys t e m s im p le m e n t e d p ro ce s s cre a t io n in a ra t h e r clu m s y wa y: wh e n a fork( ) s ys t e m ca ll wa s is s u e d , t h e ke rn e l d u p lica t e d t h e wh o le p a re n t a d d re s s s p a ce in t h e lit e ra l s e n s e o f t h e wo rd a n d a s s ig n e d t h e co p y t o t h e ch ild p ro ce s s . Th is a ct ivit y wa s q u it e t im e co n s u m in g s in ce it re q u ire d : ● ● ● ●

Allo ca t in g p a g e fra m e s fo r t h e Pa g e Ta b le s o f t h e ch ild p ro ce s s Allo ca t in g p a g e fra m e s fo r t h e p a g e s o f t h e ch ild p ro ce s s In it ia lizin g t h e Pa g e Ta b le s o f t h e ch ild p ro ce s s Co p yin g t h e p a g e s o f t h e p a re n t p ro ce s s in t o t h e co rre s p o n d in g p a g e s o f t h e ch ild p ro ce s s

Th is wa y o f cre a t in g a n a d d re s s s p a ce in vo lve d m a n y m e m o ry a cce s s e s , u s e d u p m a n y CPU cycle s , a n d co m p le t e ly s p o ile d t h e ca ch e co n t e n t s . La s t b u t n o t le a s t , it wa s o ft e n p o in t le s s b e ca u s e m a n y ch ild p ro ce s s e s s t a rt t h e ir e xe cu t io n b y lo a d in g a n e w p ro g ra m , t h u s d is ca rd in g e n t ire ly t h e in h e rit e d a d d re s s s p a ce ( s e e Ch a p t e r 2 0 ) . Mo d e rn Un ix ke rn e ls , in clu d in g Lin u x, fo llo w a m o re e fficie n t a p p ro a ch ca lle d Co p y On W rit e ( COW ) . Th e id e a is q u it e s im p le : in s t e a d o f d u p lica t in g p a g e fra m e s , t h e y a re s h a re d b e t we e n t h e p a re n t a n d t h e ch ild p ro ce s s . Ho we ve r, a s lo n g a s t h e y a re s h a re d , t h e y ca n n o t b e m o d ifie d . Wh e n e ve r t h e p a re n t o r t h e ch ild p ro ce s s a t t e m p t s t o writ e in t o a s h a re d p a g e fra m e , a n e xce p t io n o ccu rs . At t h is p o in t , t h e ke rn e l d u p lica t e s t h e p a g e in t o a n e w p a g e fra m e t h a t it m a rks a s writ a b le . Th e o rig in a l p a g e fra m e re m a in s

writ e - p ro t e ct e d : wh e n t h e o t h e r p ro ce s s t rie s t o writ e in t o it , t h e ke rn e l ch e cks wh e t h e r t h e writ in g p ro ce s s is t h e o n ly o wn e r o f t h e p a g e fra m e ; in s u ch a ca s e , it m a ke s t h e p a g e fra m e writ a b le fo r t h e p ro ce s s . Th e count fie ld o f t h e p a g e d e s crip t o r is u s e d t o ke e p t ra ck o f t h e n u m b e r o f p ro ce s s e s t h a t a re s h a rin g t h e co rre s p o n d in g p a g e fra m e . Wh e n e ve r a p ro ce s s re le a s e s a p a g e fra m e o r a Co p y On Writ e is e xe cu t e d o n it , it s count fie ld is d e cre m e n t e d ; t h e p a g e fra m e is fre e d o n ly wh e n count b e co m e s NULL.

Le t 's n o w d e s crib e h o w Lin u x im p le m e n t s COW. Wh e n handle_ pte_fault( ) d e t e rm in e s t h a t t h e Pa g e Fa u lt e xce p t io n wa s ca u s e d b y a n a cce s s t o a p a g e p re s e n t in m e m o ry, it e xe cu t e s t h e fo llo win g in s t ru ct io n s :

if (pte_present(entry)) { if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, entry); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); set_pte(pte, entry); flush_tlb_page(vma, address); spin_unlock(&mm->page_table_lock); return 1; } Th e handle_pte_fault( ) fu n ct io n is a rch it e ct u re - in d e p e n d e n t : it co n s id e rs a n y p o s s ib le vio la t io n o f t h e p a g e a cce s s rig h t s . Ho we ve r, in t h e 8 0 x 8 6 a rch it e ct u re , if t h e p a g e is p re s e n t t h e n t h e a cce s s wa s fo r writ in g a n d t h e p a g e fra m e is writ e - p ro t e ct e d ( s e e S e ct io n 8 . 4 . 2 ) . Th u s , t h e do_wp_page( ) fu n ct io n is a lwa ys in vo ke d . Th e do_wp_page( ) fu n ct io n s t a rt s b y d e rivin g t h e p a g e d e s crip t o r o f t h e p a g e fra m e re fe re n ce d b y t h e Pa g e Ta b le e n t ry in vo lve d in t h e Pa g e Fa u lt e xce p t io n . Ne xt , t h e fu n ct io n d e t e rm in e s wh e t h e r t h e p a g e m u s t re a lly b e d u p lica t e d . If o n ly o n e p ro ce s s o wn s t h e p a g e , Co p y On Writ e d o e s n o t a p p ly a n d t h e p ro ce s s s h o u ld b e fre e t o writ e t h e p a g e . Ba s ica lly, t h e fu n ct io n re a d s t h e count fie ld o f t h e p a g e d e s crip t o r: if it is e q u a l t o 1 , COW m u s t n o t b e d o n e . Act u a lly, t h e ch e ck is s lig h t ly m o re co m p lica t e d , s in ce t h e count fie ld is a ls o in cre m e n t e d wh e n t h e p a g e is in s e rt e d in t o t h e s wa p ca ch e ( s e e S e ct io n 1 6 . 3 ) . Ho we ve r, wh e n COW is n o t t o b e d o n e , t h e p a g e fra m e is m a rke d a s writ a b le s o t h a t it d o e s n o t ca u s e fu rt h e r Pa g e Fa u lt e xce p t io n s wh e n writ e s a re a t t e m p t e d :

set_pte(page_table, pte_mkyoung(pte_mkdirty(pte_mkwrite(pte)))); flush_tlb_page(vma, address); spin_unlock(&mm->page_table_lock); return 1; /* minor fault */ If t h e p a g e is s h a re d a m o n g s e ve ra l p ro ce s s e s b y m e a n s o f t h e COW, t h e fu n ct io n co p ie s t h e co n t e n t o f t h e o ld p a g e fra m e ( old_page) in t o t h e n e wly a llo ca t e d o n e ( new_page) . To a vo id ra ce co n d it io n s , t h e u s a g e co u n t e r o f old_page is in cre m e n t e d b e fo re s t a rt in g t h e co p y o p e ra t io n :

old_page = pte_page(pte); atomic_inc(&old_page->count); spin_unlock(&mm->page_table_lock); new_page = alloc_page(GFP_HIGHUSER); vto = kmap_atomic(new_page, KM_USER0); if (old_page == ZERO_PAGE) { memset((void *)vto, 0, PAGE_SIZE); } else { vfrom = kmap_atomic(old_page, KM_USER1); memcpy((void *)vto, (void *)vfrom, PAGE_SIZE); kunmap_atomic(vfrom, KM_USER1);

} kunmap_atomic(vto, KM_USER0); If t h e o ld p a g e is t h e ze ro p a g e , t h e n e w fra m e is e fficie n t ly fille d wit h ze ro s b y u s in g t h e memset m a cro . Ot h e rwis e , t h e p a g e fra m e co n t e n t is co p ie d u s in g t h e memcpy m a cro . S p e cia l h a n d lin g fo r t h e ze ro p a g e is n o t s t rict ly re q u ire d , b u t it im p ro ve s t h e s ys t e m p e rfo rm a n ce b e ca u s e it p re s e rve s t h e m icro p ro ce s s o r h a rd wa re ca ch e b y m a kin g fe we r a d d re s s re fe re n ce s . S in ce t h e a llo ca t io n o f a p a g e fra m e ca n b lo ck t h e p ro ce s s , t h e fu n ct io n ch e cks wh e t h e r t h e Pa g e Ta b le e n t ry h a s b e e n m o d ifie d s in ce t h e b e g in n in g o f t h e fu n ct io n ( pte a n d *page_table d o n o t h a ve t h e s a m e va lu e ) . In t h is ca s e , t h e n e w p a g e fra m e is re le a s e d , t h e u s a g e co u n t e r o f old_page is d e cre m e n t ( t o u n d o t h e in cre m e n t m a d e p re vio u s ly) , a n d t h e fu n ct io n t e rm in a t e s . If e ve ryt h in g lo o ks OK, t h e p h ys ica l a d d re s s o f t h e n e w p a g e fra m e is fin a lly writ t e n in t o t h e Pa g e Ta b le e n t ry a n d t h e co rre s p o n d in g TLB re g is t e r is in va lid a t e d :

set_pte(pte, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)))); flush_tlb_page(vma, address); lru_cache_add(new_page); spin_unlock(&mm->page_table_lock); Th e lru_cache_add( ) in s e rt s t h e n e w p a g e fra m e in t h e s wa p - re la t e d d a t a s t ru ct u re s ; s e e Ch a p t e r 1 6 fo r it s d e s crip t io n . Fin a lly, do_wp_page( ) d e cre m e n t s t h e u s a g e co u n t e r o f old_page t wice . Th e firs t d e cre m e n t u n d o e s t h e s a fe t y in cre m e n t m a d e b e fo re co p yin g t h e p a g e fra m e co n t e n t s ; t h e s e co n d d e cre m e n t re fle ct s t h e fa ct t h a t t h e cu rre n t p ro ce s s n o lo n g e r o wn s t h e p a g e fra m e .

8.4.5 Handling Noncontiguous Memory Area Accesses We h a ve s e e n in S e ct io n 7 . 3 t h a t t h e ke rn e l is q u it e la zy in u p d a t in g t h e Pa g e Ta b le e n t rie s co rre s p o n d in g t o n o n co n t ig u o u s m e m o ry a re a s . In fa ct , t h e vmalloc( ) a n d vfree( ) fu n ct io n s lim it t h e m s e lve s t o u p d a t e t h e m a s t e r ke rn e l Pa g e Ta b le s ( i. e . , t h e Pa g e Glo b a l Dire ct o ry init_mm.pgd a n d it s ch ild Pa g e Ta b le s ) . Ho we ve r, o n ce t h e ke rn e l in it ia liza t io n p h a s e e n d s , t h e m a s t e r ke rn e l Pa g e Ta b le s a re n o t d ire ct ly u s e d b y a n y p ro ce s s o r ke rn e l t h re a d . Th u s , co n s id e r t h e firs t t im e t h a t a p ro ce s s in Ke rn e l Mo d e a cce s s e s a n o n co n t ig u o u s m e m o ry a re a . Wh e n t ra n s la t in g t h e lin e a r a d d re s s in t o a p h ys ica l a d d re s s , t h e CPU's m e m o ry m a n a g e m e n t u n it e n co u n t e rs a n u ll Pa g e Ta b le e n t ry a n d ra is e s a Pa g e Fa u lt . Ho we ve r, t h e h a n d le r re co g n ize s t h is s p e cia l ca s e b e ca u s e t h e e xce p t io n o ccu rre d in Ke rn e l Mo d e a n d t h e fa u lt y lin e a r a d d re s s is g re a t e r t h a n TASK_SIZE. Th u s , t h e h a n d le r ch e cks t h e co rre s p o n d in g m a s t e r ke rn e l Pa g e Ta b le e n t ry:

vmalloc_fault: asm("movl %%cr3,%0":"=r" (pgd)); pgd = _ _pgd_offset(address) + (pgd_t *) _ _va(pgd); pgd_k = init_mm.pgd + _ _pgd_offset(address); if (!pgd_present(*pgd_k)) goto no_context; set_pgd(pgd, *pgd_k); pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (!pmd_present(*pmd_k)) goto no_context; set_pmd(pmd, *pmd_k); pte_k = pte_offset(pmd_k, address); if (!pte_present(*pte_k)) goto no_context;

return; Th e pgd lo ca l va ria b le is lo a d e d wit h t h e Pa g e Glo b a l Dire ct o ry a d d re s s o f t h e cu rre n t p ro ce s s , wh ich is s t o re d in t h e cr3 re g is t e r, [ 9 ] wh ile t h e pgd_k lo ca l va ria b le is lo a d e d wit h t h e m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry. If t h e e n t ry co rre s p o n d in g t o t h e fa u lt y lin e a r a d d re s s is n u ll, t h e fu n ct io n ju m p s t o t h e co d e a t t h e no_context la b e l ( s e e t h e e a rlie r s e ct io n S e ct io n 8 . 4 . 1 ) . Ot h e rwis e , t h e e n t ry is co p ie d in t o t h e co rre s p o n d in g e n t ry o f t h e p ro ce s s Pa g e Glo b a l Dire ct o ry. Th e n t h e wh o le o p e ra t io n is re p e a t e d wit h t h e m a s t e r Pa g e Mid d le Dire ct o ry e n t ry a n d , s u b s e q u e n t ly, wit h t h e m a s t e r Pa g e Ta b le e n t ry. [9]

Th e ke rn e l d o e s n 't u s e current->mm->pgd t o d e rive t h e a d d re s s b e ca u s e t h is fa u lt ca n o ccu r a t a n y in s t a n t , e ve n d u rin g a p ro ce s s s wit ch .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.5 Creating and Deleting a Process Address Space Of t h e s ix t yp ica l ca s e s m e n t io n e d e a rlie r in S e ct io n 8 . 1 , in wh ich a p ro ce s s g e t s n e w m e m o ry re g io n s , t h e firs t o n e —is s u in g a fork( ) s ys t e m ca ll—re q u ire s t h e cre a t io n o f a wh o le n e w a d d re s s s p a ce fo r t h e ch ild p ro ce s s . Co n ve rs e ly, wh e n a p ro ce s s t e rm in a t e s , t h e ke rn e l d e s t ro ys it s a d d re s s s p a ce . In t h is s e ct io n , we d is cu s s h o w t h e s e t wo a ct ivit ie s a re p e rfo rm e d b y Lin u x.

8.5.1 Creating a Process Address Space In S e ct io n 3 . 4 . 1 , we m e n t io n e d t h a t t h e ke rn e l in vo ke s t h e copy_mm( ) fu n ct io n wh ile cre a t in g a n e w p ro ce s s . Th is fu n ct io n cre a t e s t h e p ro ce s s a d d re s s s p a ce b y s e t t in g u p a ll Pa g e Ta b le s a n d m e m o ry d e s crip t o rs o f t h e n e w p ro ce s s . Ea ch p ro ce s s u s u a lly h a s it s o wn a d d re s s s p a ce , b u t lig h t we ig h t p ro ce s s e s ca n b e cre a t e d b y ca llin g clone( ) wit h t h e CLONE_VM fla g s e t . Th e s e p ro ce s s e s s h a re t h e s a m e a d d re s s s p a ce ; t h a t is , t h e y a re a llo we d t o a d d re s s t h e s a m e s e t o f p a g e s . Fo llo win g t h e COW a p p ro a ch d e s crib e d e a rlie r, t ra d it io n a l p ro ce s s e s in h e rit t h e a d d re s s s p a ce o f t h e ir p a re n t : p a g e s s t a y s h a re d a s lo n g a s t h e y a re o n ly re a d . Wh e n o n e o f t h e p ro ce s s e s a t t e m p t s t o writ e o n e o f t h e m , h o we ve r, t h e p a g e is d u p lica t e d ; a ft e r s o m e t im e , a fo rke d p ro ce s s u s u a lly g e t s it s o wn a d d re s s s p a ce t h a t is d iffe re n t fro m t h a t o f t h e p a re n t p ro ce s s . Lig h t we ig h t p ro ce s s e s , o n t h e o t h e r h a n d , u s e t h e a d d re s s s p a ce o f t h e ir p a re n t p ro ce s s . Lin u x im p le m e n t s t h e m s im p ly b y n o t d u p lica t in g a d d re s s s p a ce . Lig h t we ig h t p ro ce s s e s ca n b e cre a t e d co n s id e ra b ly fa s t e r t h a n n o rm a l p ro ce s s e s , a n d t h e s h a rin g o f p a g e s ca n a ls o b e co n s id e re d a b e n e fit s o lo n g a s t h e p a re n t a n d ch ild re n co o rd in a t e t h e ir a cce s s e s ca re fu lly. If t h e n e w p ro ce s s h a s b e e n cre a t e d b y m e a n s o f t h e clone( ) s ys t e m ca ll a n d if t h e

CLONE_VM fla g o f t h e flag p a ra m e t e r is s e t , copy_mm( ) g ive s t h e clo n e ( tsk) t h e a d d re s s s p a ce o f it s p a re n t ( current) : if (clone_flags & CLONE_VM) { atomic_inc(¤t->mm->mm_users); tsk->mm = current->mm; tsk->active_mm = current->mm; return 0; } If t h e CLONE_VM fla g is n o t s e t , copy_mm( ) m u s t cre a t e a n e w a d d re s s s p a ce ( e ve n t h o u g h n o m e m o ry is a llo ca t e d wit h in t h a t a d d re s s s p a ce u n t il t h e p ro ce s s re q u e s t s a n a d d re s s ) . Th e fu n ct io n a llo ca t e s a n e w m e m o ry d e s crip t o r, s t o re s it s a d d re s s in t h e mm fie ld o f t h e n e w p ro ce s s d e s crip t o r tsk, a n d t h e n in it ia lize s it s fie ld s :

tsk->mm = kmem_cache_alloc(mm_cachep, SLAB_KERNEL); tsk->active_mm = tsk->mm; memcpy(tsk->mm, current->mm, sizeof(*tsk->mm)); atomic_set(&tsk->mm->mm_users, 1); atomic_set(&tsk->mm->mm_count, 1); init_rwsem(&tsk->mm->mmap_sem);

tsk->mm->page_table_lock = SPIN_LOCK_UNLOCKED; tsk->mm->pgd = pgd_alloc(tsk->mm); Re m e m b e r t h a t t h e pgd_alloc( ) m a cro a llo ca t e s a Pa g e Glo b a l Dire ct o ry fo r t h e n e w p ro ce s s . Th e dup_mmap( ) fu n ct io n is t h e n in vo ke d t o d u p lica t e b o t h t h e m e m o ry re g io n s a n d t h e Pa g e Ta b le s o f t h e p a re n t p ro ce s s :

down_write(¤t->mm->mmap_sem); dup_mmap(tsk->mm); up_write(¤t->mm->mmap_sem); copy_segments(tsk, tsk->mm); Th e dup_mmap( ) fu n ct io n in s e rt s t h e n e w m e m o ry d e s crip t o r tsk->mm in t h e g lo b a l lis t o f m e m o ry d e s crip t o rs . Th e n it s ca n s t h e lis t o f re g io n s o wn e d b y t h e p a re n t p ro ce s s , s t a rt in g fro m t h e o n e p o in t e d b y current->mm->mmap. It d u p lica t e s e a ch vm_area_struct m e m o ry re g io n d e s crip t o r e n co u n t e re d a n d in s e rt s t h e co p y in t h e lis t o f re g io n s o wn e d b y t h e ch ild p ro ce s s . Rig h t a ft e r in s e rt in g a n e w m e m o ry re g io n d e s crip t o r, dup_mmap( ) in vo ke s

copy_page_range( ) t o cre a t e , if n e ce s s a ry, t h e Pa g e Ta b le s n e e d e d t o m a p t h e g ro u p o f p a g e s in clu d e d in t h e m e m o ry re g io n a n d t o in it ia lize t h e n e w Pa g e Ta b le e n t rie s . In p a rt icu la r, a n y p a g e fra m e co rre s p o n d in g t o a p riva t e , writ a b le p a g e ( VM_SHARE fla g o ff a n d

VM_MAYWRITE fla g o n ) is m a rke d a s re a d - o n ly fo r b o t h t h e p a re n t a n d t h e ch ild , s o t h a t it will b e h a n d le d wit h t h e Co p y On Writ e m e ch a n is m . Be fo re t e rm in a t in g , dup_mmap( ) a ls o cre a t e s t h e re d - b la ck t re e o f m e m o ry re g io n s o f t h e ch ild p ro ce s s b y in vo kin g t h e build_mmap_rb( ) fu n ct io n .

Fin a lly, copy_mm( ) in vo ke s copy_segments( ), wh ich in it ia lize s t h e a rch it e ct u re d e p e n d e n t p o rt io n o f t h e ch ild 's m e m o ry d e s crip t o r. Es s e n t ia lly, if t h e p a re n t h a s a cu s t o m LDT, a co p y o f it is a ls o a s s ig n e d t o t h e ch ild .

8.5.2 Deleting a Process Address Space Wh e n a p ro ce s s t e rm in a t e s , t h e ke rn e l in vo ke s t h e exit_mm( ) fu n ct io n t o re le a s e t h e a d d re s s s p a ce o wn e d b y t h a t p ro ce s s :

mm_release(); if (tsk->mm) { atomic_inc(&tsk->mm->mm_count); mm = tsk->mm; tsk->mm = NULL; enter_lazy_tlb(mm, current, smp_processor_id()); mmput(mm); } Th e mm_release( ) fu n ct io n wa ke s u p a n y p ro ce s s s le e p in g in t h e tsk->vfork_done co m p le t io n ( s e e S e ct io n 5 . 3 . 8 ) . Typ ica lly, t h e co rre s p o n d in g wa it q u e u e is n o n e m p t y o n ly if t h e e xit in g p ro ce s s wa s cre a t e d b y m e a n s o f t h e vfork( ) s ys t e m ca ll ( s e e S e ct io n 3 . 4 . 1 ) .

Th e p ro ce s s o r is a ls o p u t in la zy TLB m o d e ( s e e Ch a p t e r 2 ) .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

8.6 Managing the Heap Ea ch Un ix p ro ce s s o wn s a s p e cific m e m o ry re g io n ca lle d h e a p , wh ich is u s e d t o s a t is fy t h e p ro ce s s 's d yn a m ic m e m o ry re q u e s t s . Th e start_brk a n d brk fie ld s o f t h e m e m o ry d e s crip t o r d e lim it t h e s t a rt in g a n d e n d in g a d d re s s e s , re s p e ct ive ly, o f t h a t re g io n . Th e fo llo win g C lib ra ry fu n ct io n s ca n b e u s e d b y t h e p ro ce s s t o re q u e s t a n d re le a s e d yn a m ic m e m o ry:

malloc(size) Re q u e s t s size b yt e s o f d yn a m ic m e m o ry; if t h e a llo ca t io n s u cce e d s , it re t u rn s t h e lin e a r a d d re s s o f t h e firs t m e m o ry lo ca t io n .

calloc(n,size) Re q u e s t s a n a rra y co n s is t in g o f n e le m e n t s o f s ize size; if t h e a llo ca t io n s u cce e d s , it in it ia lize s t h e a rra y co m p o n e n t s t o 0 a n d re t u rn s t h e lin e a r a d d re s s o f t h e firs t e le m e n t .

free(addr) Re le a s e s t h e m e m o ry re g io n a llo ca t e d b y malloc( ) o r calloc( ) t h a t h a s a n in it ia l a d d re s s o f addr.

brk(addr) Mo d ifie s t h e s ize o f t h e h e a p d ire ct ly; t h e addr p a ra m e t e r s p e cifie s t h e n e w va lu e o f

current->mm->brk, a n d t h e re t u rn va lu e is t h e n e w e n d in g a d d re s s o f t h e m e m o ry re g io n ( t h e p ro ce s s m u s t ch e ck wh e t h e r it co in cid e s wit h t h e re q u e s t e d addr va lu e ) . sbrk(incr) Is s im ila r t o brk( ), e xce p t t h a t t h e incr p a ra m e t e r s p e cifie s t h e in cre m e n t o r d e cre m e n t o f t h e h e a p s ize in b yt e s . Th e brk( ) fu n ct io n d iffe rs fro m t h e o t h e r fu n ct io n s lis t e d b e ca u s e it is t h e o n ly o n e im p le m e n t e d a s a s ys t e m ca ll. All t h e o t h e r fu n ct io n s a re im p le m e n t e d in t h e C lib ra ry b y u s in g brk( ) a n d mmap( ).

Wh e n a p ro ce s s in Us e r Mo d e in vo ke s t h e brk( ) s ys t e m ca ll, t h e ke rn e l e xe cu t e s t h e

sys_brk(addr) fu n ct io n ( s e e Ch a p t e r 9 ) . Th is fu n ct io n firs t ve rifie s wh e t h e r t h e addr p a ra m e t e r fa lls in s id e t h e m e m o ry re g io n t h a t co n t a in s t h e p ro ce s s co d e ; if s o , it re t u rn s im m e d ia t e ly:

mm = current->mm;

down_write(&mm->mmap_sem); if (addr < mm->end_code) { out: up_write(&mm->mmap_sem); return mm->brk; } S in ce t h e brk( ) s ys t e m ca ll a ct s o n a m e m o ry re g io n , it a llo ca t e s a n d d e a llo ca t e s wh o le p a g e s . Th e re fo re , t h e fu n ct io n a lig n s t h e va lu e o f addr t o a m u lt ip le o f PAGE_SIZE a n d co m p a re s t h e re s u lt wit h t h e va lu e o f t h e brk fie ld o f t h e m e m o ry d e s crip t o r:

newbrk = (addr + 0xfff) & 0xfffff000; oldbrk = (mm->brk + 0xfff) & 0xfffff000; if (oldbrk == newbrk) { mm->brk = addr; goto out; } If t h e p ro ce s s a s ke d t o s h rin k t h e h e a p , sys_brk( ) in vo ke s t h e do_munmap( ) fu n ct io n t o d o t h e jo b a n d t h e n re t u rn s :

if (addr brk) { if (!do_munmap(mm, newbrk, oldbrk-newbrk)) mm->brk = addr; goto out; } If t h e p ro ce s s a s ke d t o e n la rg e t h e h e a p , sys_brk( ) firs t ch e cks wh e t h e r t h e p ro ce s s is a llo we d t o d o s o . If t h e p ro ce s s is t ryin g t o a llo ca t e m e m o ry o u t s id e it s lim it , t h e fu n ct io n s im p ly re t u rn s t h e o rig in a l va lu e o f mm->brk wit h o u t a llo ca t in g m o re m e m o ry:

rlim = current->rlim[RLIMIT_DATA].rlim_cur; if (rlim < RLIM_INFINITY && addr - mm->start_data > rlim) goto out; Th e fu n ct io n t h e n ch e cks wh e t h e r t h e e n la rg e d h e a p wo u ld o ve rla p s o m e o t h e r m e m o ry re g io n b e lo n g in g t o t h e p ro ce s s a n d , if s o , re t u rn s wit h o u t d o in g a n yt h in g :

if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE)) goto out; Th e la s t ch e ck b e fo re p ro ce e d in g t o t h e e xp a n s io n co n s is t s o f ve rifyin g wh e t h e r t h e a va ila b le fre e virt u a l m e m o ry is s u fficie n t t o s u p p o rt t h e e n la rg e d h e a p ( s e e t h e e a rlie r s e ct io n S e ct io n 8 . 3 . 4 ) :

if (!vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT)) goto out; If e ve ryt h in g is OK, t h e do_brk( ) fu n ct io n is in vo ke d wit h t h e MAP_FIXED fla g s e t . If it re t u rn s t h e oldbrk va lu e , t h e a llo ca t io n wa s s u cce s s fu l a n d sys_brk( ) re t u rn s t h e va lu e

addr; o t h e rwis e , it re t u rn s t h e o ld mm->brk va lu e :

if (do_brk(oldbrk, newbrk-oldbrk) == oldbrk) mm->brk = addr; goto out; Th e do_brk( ) fu n ct io n is a ct u a lly a s im p lifie d ve rs io n o f do_mmap( ) t h a t h a n d le s o n ly a n o n ym o u s m e m o ry re g io n s . It s in vo ca t io n m ig h t b e co n s id e re d e q u iva le n t t o :

do_mmap(NULL, oldbrk, newbrk-oldbrk, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_FIXED|MAP_PRIVATE, 0) Of co u rs e , do_brk( ) is s lig h t ly fa s t e r t h a n do_mmap( ) b e ca u s e it a vo id s s e ve ra l ch e cks o n t h e m e m o ry re g io n o b je ct fie ld s b y a s s u m in g t h a t t h e m e m o ry re g io n d o e s n 't m a p a file o n d is k. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 9. System Calls Op e ra t in g s ys t e m s o ffe r p ro ce s s e s ru n n in g in Us e r Mo d e a s e t o f in t e rfa ce s t o in t e ra ct wit h h a rd wa re d e vice s s u ch a s t h e CPU, d is ks , a n d p rin t e rs . Pu t t in g a n e xt ra la ye r b e t we e n t h e a p p lica t io n a n d t h e h a rd wa re h a s s e ve ra l a d va n t a g e s . Firs t , it m a ke s p ro g ra m m in g e a s ie r b y fre e in g u s e rs fro m s t u d yin g lo w- le ve l p ro g ra m m in g ch a ra ct e ris t ics o f h a rd wa re d e vice s . S e co n d , it g re a t ly in cre a s e s s ys t e m s e cu rit y, s in ce t h e ke rn e l ca n ch e ck t h e a ccu ra cy o f t h e re q u e s t a t t h e in t e rfa ce le ve l b e fo re a t t e m p t in g t o s a t is fy it . La s t b u t n o t le a s t , t h e s e in t e rfa ce s m a ke p ro g ra m s m o re p o rt a b le s in ce t h e y ca n b e co m p ile d a n d e xe cu t e d co rre ct ly o n a n y ke rn e l t h a t o ffe rs t h e s a m e s e t o f in t e rfa ce s . Un ix s ys t e m s im p le m e n t m o s t in t e rfa ce s b e t we e n Us e r Mo d e p ro ce s s e s a n d h a rd wa re d e vice s b y m e a n s o f s y s t e m ca lls is s u e d t o t h e ke rn e l. Th is ch a p t e r e xa m in e s in d e t a il h o w Lin u x im p le m e n t s s ys t e m ca lls t h a t Us e r Mo d e p ro g ra m s is s u e t o t h e ke rn e l.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

9.1 POSIX APIs and System Calls Le t 's s t a rt b y s t re s s in g t h e d iffe re n ce b e t we e n a n a p p lica t io n p ro g ra m m e r in t e rfa ce ( API) a n d a s ys t e m ca ll. Th e fo rm e r is a fu n ct io n d e fin it io n t h a t s p e cifie s h o w t o o b t a in a g ive n s e rvice , wh ile t h e la t t e r is a n e xp licit re q u e s t t o t h e ke rn e l m a d e via a s o ft wa re in t e rru p t . Un ix s ys t e m s in clu d e s e ve ra l lib ra rie s o f fu n ct io n s t h a t p ro vid e APIs t o p ro g ra m m e rs . S o m e o f t h e APIs d e fin e d b y t h e lib c s t a n d a rd C lib ra ry re fe r t o w ra p p e r ro u t in e s ( ro u t in e s wh o s e o n ly p u rp o s e is t o is s u e a s ys t e m ca ll) . Us u a lly, e a ch s ys t e m ca ll h a s a co rre s p o n d in g wra p p e r ro u t in e , wh ich d e fin e s t h e API t h a t a p p lica t io n p ro g ra m s s h o u ld e m p lo y. Th e co n ve rs e is n o t t ru e , b y t h e wa y—a n API d o e s n o t n e ce s s a rily co rre s p o n d t o a s p e cific s ys t e m ca ll. Firs t o f a ll, t h e API co u ld o ffe r it s s e rvice s d ire ct ly in Us e r Mo d e . ( Fo r s o m e t h in g a b s t ra ct like m a t h fu n ct io n s , t h e re m a y b e n o re a s o n t o m a ke s ys t e m ca lls . ) S e co n d , a s in g le API fu n ct io n co u ld m a ke s e ve ra l s ys t e m ca lls . Mo re o ve r, s e ve ra l API fu n ct io n s co u ld m a ke t h e s a m e s ys t e m ca ll, b u t wra p e xt ra fu n ct io n a lit y a ro u n d it . Fo r in s t a n ce , in Lin u x, t h e malloc( ), calloc( ), a n d free( ) APIs a re im p le m e n t e d in t h e lib c lib ra ry. Th e co d e in t h is lib ra ry ke e p s t ra ck o f t h e a llo ca t io n a n d d e a llo ca t io n re q u e s t s a n d u s e s t h e brk(

) s ys t e m ca ll t o e n la rg e o r s h rin k t h e p ro ce s s h e a p ( s e e S e ct io n 8 . 6 ) . Th e POS IX s t a n d a rd re fe rs t o APIs a n d n o t t o s ys t e m ca lls . A s ys t e m ca n b e ce rt ifie d a s POS IX- co m p lia n t if it o ffe rs t h e p ro p e r s e t o f APIs t o t h e a p p lica t io n p ro g ra m s , n o m a t t e r h o w t h e co rre s p o n d in g fu n ct io n s a re im p le m e n t e d . As a m a t t e r o f fa ct , s e ve ra l n o n - Un ix s ys t e m s h a ve b e e n ce rt ifie d a s POS IX- co m p lia n t , s in ce t h e y o ffe r a ll t ra d it io n a l Un ix s e rvice s in Us e r Mo d e lib ra rie s . Fro m t h e p ro g ra m m e r's p o in t o f vie w, t h e d is t in ct io n b e t we e n a n API a n d a s ys t e m ca ll is irre le va n t — t h e o n ly t h in g s t h a t m a t t e r a re t h e fu n ct io n n a m e , t h e p a ra m e t e r t yp e s , a n d t h e m e a n in g o f t h e re t u rn co d e . Fro m t h e ke rn e l d e s ig n e r's p o in t o f vie w, h o we ve r, t h e d is t in ct io n d o e s m a t t e r s in ce s ys t e m ca lls b e lo n g t o t h e ke rn e l, wh ile Us e r Mo d e lib ra rie s d o n 't . Mo s t wra p p e r ro u t in e s re t u rn a n in t e g e r va lu e , wh o s e m e a n in g d e p e n d s o n t h e co rre s p o n d in g s ys t e m ca ll. A re t u rn va lu e o f - 1 u s u a lly in d ica t e s t h a t t h e ke rn e l wa s u n a b le t o s a t is fy t h e p ro ce s s re q u e s t . A fa ilu re in t h e s ys t e m ca ll h a n d le r m a y b e ca u s e d b y in va lid p a ra m e t e rs , a la ck o f a va ila b le re s o u rce s , h a rd wa re p ro b le m s , a n d s o o n . Th e s p e cific e rro r co d e is co n t a in e d in t h e errno va ria b le , wh ich is d e fin e d in t h e lib c lib ra ry.

Ea ch e rro r co d e is d e fin e d a s a m a cro co n s t a n t , wh ich yie ld s a co rre s p o n d in g p o s it ive in t e g e r va lu e . Th e POS IX s t a n d a rd s p e cifie s t h e m a cro n a m e s o f s e ve ra l e rro r co d e s . In Lin u x, o n 8 0 x 8 6 s ys t e m s , t h e s e m a cro s a re d e fin e d in t h e h e a d e r file in clu d e / a s m - i3 8 6 / e rrn o . h . To a llo w p o rt a b ilit y o f C p ro g ra m s a m o n g Un ix s ys t e m s , t h e in clu d e / a s m - i3 8 6 / e rrn o . h h e a d e r file is in clu d e d , in t u rn , in t h e s t a n d a rd / u s r/ in clu d e / e rrn o . h C lib ra ry h e a d e r file . Ot h e r s ys t e m s h a ve t h e ir o wn s p e cia lize d s u b d ire ct o rie s o f h e a d e r file s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

9.2 System Call Handler and Service Routines Wh e n a Us e r Mo d e p ro ce s s in vo ke s a s ys t e m ca ll, t h e CPU s wit ch e s t o Ke rn e l Mo d e a n d s t a rt s t h e e xe cu t io n o f a ke rn e l fu n ct io n . In Lin u x a s ys t e m ca ll m u s t b e in vo ke d b y e xe cu t in g t h e int $0x80 a s s e m b ly la n g u a g e in s t ru ct io n , wh ich ra is e s t h e p ro g ra m m e d e xce p t io n t h a t h a s ve ct o r 1 2 8 ( s e e S e ct io n 4 . 4 . 1 a n d S e ct io n 4 . 2 . 4 , b o t h in Ch a p t e r 4 ) . S in ce t h e ke rn e l im p le m e n t s m a n y d iffe re n t s ys t e m ca lls , t h e p ro ce s s m u s t p a s s a p a ra m e t e r ca lle d t h e s y s t e m ca ll n u m b e r t o id e n t ify t h e re q u ire d s ys t e m ca ll; t h e eax re g is t e r is u s e d fo r t h is p u rp o s e . As we s h a ll s e e in S e ct io n 9 . 2 . 3 la t e r in t h is ch a p t e r, a d d it io n a l p a ra m e t e rs a re u s u a lly p a s s e d wh e n in vo kin g a s ys t e m ca ll. All s ys t e m ca lls re t u rn a n in t e g e r va lu e . Th e co n ve n t io n s fo r t h e s e re t u rn va lu e s a re d iffe re n t fro m t h o s e fo r wra p p e r ro u t in e s . In t h e ke rn e l, p o s it ive o r 0 va lu e s d e n o t e a s u cce s s fu l t e rm in a t io n o f t h e s ys t e m ca ll, wh ile n e g a t ive va lu e s d e n o t e a n e rro r co n d it io n . In t h e la t t e r ca s e , t h e va lu e is t h e n e g a t io n o f t h e e rro r co d e t h a t m u s t b e re t u rn e d t o t h e a p p lica t io n p ro g ra m in t h e errno va ria b le . Th e errno va ria b le is n o t s e t o r u s e d b y t h e ke rn e l. In s t e a d , t h e wra p p e r ro u t in e s h a n d le s t h e t a s k o f s e t t in g t h is va ria b le a ft e r a re t u rn fro m a s ys t e m ca ll. Th e s ys t e m ca ll h a n d le r, wh ich h a s a s t ru ct u re s im ila r t o t h a t o f t h e o t h e r e xce p t io n h a n d le rs , p e rfo rm s t h e fo llo win g o p e ra t io n s : ●





S a ve s t h e co n t e n t s o f m o s t re g is t e rs in t h e Ke rn e l Mo d e s t a ck ( t h is o p e ra t io n is co m m o n t o a ll s ys t e m ca lls a n d is co d e d in a s s e m b ly la n g u a g e ) . Ha n d le s t h e s ys t e m ca ll b y in vo kin g a co rre s p o n d in g C fu n ct io n ca lle d t h e s y s t e m ca ll s e rv ice ro u t in e . Exit s fro m t h e h a n d le r b y m e a n s o f t h e ret_from_sys_call( ) fu n ct io n ( wh ich is co d e d in a s s e m b ly la n g u a g e ) .

Th e n a m e o f t h e s e rvice ro u t in e a s s o cia t e d wit h t h e xyz( ) s ys t e m ca ll is u s u a lly

sys_xyz( ); t h e re a re , h o we ve r, a fe w e xce p t io n s t o t h is ru le . Fig u re 9 - 1 illu s t ra t e s t h e re la t io n s h ip s b e t we e n t h e a p p lica t io n p ro g ra m t h a t in vo ke s a s ys t e m ca ll, t h e co rre s p o n d in g wra p p e r ro u t in e , t h e s ys t e m ca ll h a n d le r, a n d t h e s ys t e m ca ll s e rvice ro u t in e . Th e a rro ws d e n o t e t h e e xe cu t io n flo w b e t we e n t h e fu n ct io n s . Fig u re 9 - 1 . I n v o k in g a s y s t e m c a ll

To a s s o cia t e e a ch s ys t e m ca ll n u m b e r wit h it s co rre s p o n d in g s e rvice ro u t in e , t h e ke rn e l u s e s a s y s t e m ca ll d is p a t ch t a b le , wh ich is s t o re d in t h e sys_call_table a rra y a n d h a s

NR_syscalls e n t rie s ( u s u a lly 2 5 6 ) . Th e n t h e n t ry co n t a in s t h e s e rvice ro u t in e a d d re s s o f t h e s ys t e m ca ll h a vin g n u m b e r n . Th e NR_syscalls m a cro is ju s t a s t a t ic lim it o n t h e m a xim u m n u m b e r o f im p le m e n t a b le s ys t e m ca lls ; it d o e s n o t in d ica t e t h e n u m b e r o f s ys t e m ca lls a ct u a lly im p le m e n t e d . In d e e d , a n y e n t ry o f t h e d is p a t ch t a b le m a y co n t a in t h e a d d re s s o f t h e sys_ni_syscall( ) fu n ct io n , wh ich is t h e s e rvice ro u t in e o f t h e "n o n im p le m e n t e d " s ys t e m ca lls ; it ju s t re t u rn s t h e e rro r co d e -ENOSYS.

9.2.1 Initializing System Calls Th e trap_init( ) fu n ct io n , in vo ke d d u rin g ke rn e l in it ia liza t io n , s e t s u p t h e In t e rru p t De s crip t o r Ta b le ( IDT) e n t ry co rre s p o n d in g t o ve ct o r 1 2 8 ( i. e . , 0x80) a s fo llo ws :

set_system_gate(0x80, &system_call); Th e ca ll lo a d s t h e fo llo win g va lu e s in t o t h e g a t e d e s crip t o r fie ld s ( s e e S e ct io n 4 . 4 . 1 ) : S e g m e n t S e le ct o r Th e _ _KERNEL_CS S e g m e n t S e le ct o r o f t h e ke rn e l co d e s e g m e n t .

Offs e t Th e p o in t e r t o t h e system_call( ) e xce p t io n h a n d le r.

Ty p e S e t t o 1 5 . In d ica t e s t h a t t h e e xce p t io n is a Tra p a n d t h a t t h e co rre s p o n d in g h a n d le r d o e s n o t d is a b le m a s ka b le in t e rru p t s . DPL ( De s crip t o r Priv ile g e Le v e l) S e t t o 3 . Th is a llo ws p ro ce s s e s in Us e r Mo d e t o in vo ke t h e e xce p t io n h a n d le r ( s e e

S e ct io n 4 . 2 . 4 ) .

9.2.2 The system_call( ) Function Th e system_call( ) fu n ct io n im p le m e n t s t h e s ys t e m ca ll h a n d le r. It s t a rt s b y s a vin g t h e s ys t e m ca ll n u m b e r a n d a ll t h e CPU re g is t e rs t h a t m a y b e u s e d b y t h e e xce p t io n h a n d le r o n t h e s t a ck — e xce p t fo r eflags, cs, eip, ss, a n d esp, wh ich h a ve a lre a d y b e e n s a ve d a u t o m a t ica lly b y t h e co n t ro l u n it ( s e e S e ct io n 4 . 2 . 4 ) . Th e SAVE_ALL m a cro , wh ich wa s a lre a d y d is cu s s e d in S e ct io n 4 . 6 . 1 . 4 , a ls o lo a d s t h e S e g m e n t S e le ct o r o f t h e ke rn e l d a t a s e g m e n t in ds a n d es:

system_call: pushl %eax SAVE_ALL movl %esp, %ebx andl $0xffffe000, %ebx Th e fu n ct io n a ls o s t o re s t h e a d d re s s o f t h e p ro ce s s d e s crip t o r in ebx. Th is is d o n e b y t a kin g t h e va lu e o f t h e ke rn e l s t a ck p o in t e r a n d ro u n d in g it u p t o a m u lt ip le o f 8 KB ( s e e S e ct io n 3.2.2). Ne xt , t h e system_call( ) fu n ct io n ch e cks wh e t h e r t h e PT_TRACESYS fla g in clu d e d in t h e

ptrace fie ld o f current is s e t — t h a t is , wh e t h e r t h e s ys t e m ca ll in vo ca t io n s o f t h e e xe cu t e d p ro g ra m a re b e in g t ra ce d b y a d e b u g g e r. If t h is is t h e ca s e , system_call( ) in vo ke s t h e syscall_trace( ) fu n ct io n t wice : o n ce rig h t b e fo re a n d o n ce rig h t a ft e r t h e e xe cu t io n o f t h e s ys t e m ca ll s e rvice ro u t in e . Th is fu n ct io n s t o p s current a n d t h u s a llo ws t h e d e b u g g in g p ro ce s s t o co lle ct in fo rm a t io n a b o u t it . A va lid it y ch e ck is t h e n p e rfo rm e d o n t h e s ys t e m ca ll n u m b e r p a s s e d b y t h e Us e r Mo d e p ro ce s s . If it is g re a t e r t h a n o r e q u a l t o NR_syscalls, t h e s ys t e m ca ll h a n d le r t e rm in a t e s :

cmpl $(NR_syscalls), %eax jb nobadsys movl $(-ENOSYS), 24(%esp) jmp ret_from_sys_call nobadsys: If t h e s ys t e m ca ll n u m b e r is n o t va lid , t h e fu n ct io n s t o re s t h e -ENOSYS va lu e in t h e s t a ck lo ca t io n wh e re t h e eax re g is t e r h a s b e e n s a ve d ( a t o ffs e t 2 4 fro m t h e cu rre n t s t a ck t o p ) . It t h e n ju m p s t o ret_from_sys_call( ). In t h is wa y, wh e n t h e p ro ce s s re s u m e s it s e xe cu t io n in Us e r Mo d e , it will fin d a n e g a t ive re t u rn co d e in eax.

Fin a lly, t h e s p e cific s e rvice ro u t in e a s s o cia t e d wit h t h e s ys t e m ca ll n u m b e r co n t a in e d in eax is in vo ke d :

call *sys_call_table(0, %eax, 4) S in ce e a ch e n t ry in t h e d is p a t ch t a b le is 4 b yt e s lo n g , t h e ke rn e l fin d s t h e a d d re s s o f t h e s e rvice ro u t in e t o b e in vo ke d b y m u lt ip lyin g t h e s ys t e m ca ll n u m b e r b y 4 , a d d in g t h e in it ia l

a d d re s s o f t h e sys_call_table d is p a t ch t a b le a n d e xt ra ct in g a p o in t e r t o t h e s e rvice ro u t in e fro m t h a t s lo t in t h e t a b le . Wh e n t h e s e rvice ro u t in e t e rm in a t e s , system_call( ) g e t s it s re t u rn co d e fro m eax a n d s t o re s it in t h e s t a ck lo ca t io n wh e re t h e Us e r Mo d e va lu e o f t h e eax re g is t e r is s a ve d . It t h e n ju m p s t o ret_from_sys_call( ), wh ich t e rm in a t e s t h e e xe cu t io n o f t h e s ys t e m ca ll h a n d le r ( s e e S e ct io n 4 . 8 . 3 ) :

movl %eax, 24(%esp) jmp ret_from_sys_call Wh e n t h e p ro ce s s re s u m e s it s e xe cu t io n in Us e r Mo d e , it fin d s t h e re t u rn co d e o f t h e s ys t e m ca ll in eax.

9.2.3 Parameter Passing Like o rd in a ry fu n ct io n s , s ys t e m ca lls o ft e n re q u ire s o m e in p u t / o u t p u t p a ra m e t e rs , wh ich m a y co n s is t o f a ct u a l va lu e s ( i. e . , n u m b e rs ) , a d d re s s e s o f va ria b le s in t h e a d d re s s s p a ce o f t h e Us e r Mo d e p ro ce s s , o r e ve n a d d re s s e s o f d a t a s t ru ct u re s in clu d in g p o in t e rs t o Us e r Mo d e fu n ct io n s ( s e e S e ct io n 1 0 . 4 ) . S in ce t h e system_call( ) fu n ct io n is t h e co m m o n e n t ry p o in t fo r a ll s ys t e m ca lls in Lin u x, e a ch o f t h e m h a s a t le a s t o n e p a ra m e t e r: t h e s ys t e m ca ll n u m b e r p a s s e d in t h e eax re g is t e r. Fo r in s t a n ce , if a n a p p lica t io n p ro g ra m in vo ke s t h e fork( ) wra p p e r ro u t in e , t h e

eax re g is t e r is s e t t o 2 ( i. e . , _ _NR_fork) b e fo re e xe cu t in g t h e int $0x80 a s s e m b ly la n g u a g e in s t ru ct io n . Be ca u s e t h e re g is t e r is s e t b y t h e wra p p e r ro u t in e s in clu d e d in t h e lib c lib ra ry, p ro g ra m m e rs d o n o t u s u a lly ca re a b o u t t h e s ys t e m ca ll n u m b e r. Th e fork( ) s ys t e m ca ll d o e s n o t re q u ire o t h e r p a ra m e t e rs . Ho we ve r, m a n y s ys t e m ca lls d o re q u ire a d d it io n a l p a ra m e t e rs , wh ich m u s t b e e xp licit ly p a s s e d b y t h e a p p lica t io n p ro g ra m . Fo r in s t a n ce , t h e mmap( ) s ys t e m ca ll m a y re q u ire u p t o s ix a d d it io n a l p a ra m e t e rs ( b e s id e s t h e s ys t e m ca ll n u m b e r) . Th e p a ra m e t e rs o f o rd in a ry C fu n ct io n s a re p a s s e d b y writ in g t h e ir va lu e s in t h e a ct ive p ro g ra m s t a ck ( e it h e r t h e Us e r Mo d e s t a ck o r t h e Ke rn e l Mo d e s t a ck) . S in ce s ys t e m ca lls a re a s p e cia l kin d o f fu n ct io n t h a t cro s s o ve r fro m u s e r t o ke rn e l la n d , n e it h e r t h e Us e r Mo d e o r t h e Ke rn e l Mo d e s t a cks ca n b e u s e d . Ra t h e r, s ys t e m ca ll p a ra m e t e rs a re writ t e n in t h e CPU re g is t e rs b e fo re in vo kin g t h e int 0x80 a s s e m b ly la n g u a g e in s t ru ct io n . Th e ke rn e l t h e n co p ie s t h e p a ra m e t e rs s t o re d in t h e CPU re g is t e rs o n t o t h e Ke rn e l Mo d e s t a ck b e fo re in vo kin g t h e s ys t e m ca ll s e rvice ro u t in e b e ca u s e t h e la t t e r is a n o rd in a ry C fu n ct io n . Wh y d o e s n 't t h e ke rn e l co p y p a ra m e t e rs d ire ct ly fro m t h e Us e r Mo d e s t a ck t o t h e Ke rn e l Mo d e s t a ck? Firs t o f a ll, wo rkin g wit h t wo s t a cks a t t h e s a m e t im e is co m p le x; s e co n d , t h e u s e o f re g is t e rs m a ke s t h e s t ru ct u re o f t h e s ys t e m ca ll h a n d le r s im ila r t o t h a t o f o t h e r e xce p t io n h a n d le rs . Ho we ve r, t o p a s s p a ra m e t e rs in re g is t e rs , t wo co n d it io n s m u s t b e s a t is fie d :



Th e le n g t h o f e a ch p a ra m e t e r ca n n o t e xce e d t h e le n g t h o f a re g is t e r ( 3 2 b it s ) . [ 1 ]

[1]

We re fe r, a s u s u a l, t o t h e 3 2 - b it a rch it e ct u re o f t h e 8 0 x 8 6 p ro ce s s o rs . Th e d is cu s s io n in t h is s e ct io n d o e s n o t a p p ly t o 6 4 - b it a rch it e ct u re s .



Th e n u m b e r o f p a ra m e t e rs m u s t n o t e xce e d s ix ( in clu d in g t h e s ys t e m ca ll n u m b e r p a s s e d in eax) , s in ce t h e In t e l Pe n t iu m h a s a ve ry lim it e d n u m b e r o f re g is t e rs .

Th e firs t co n d it io n is a lwa ys t ru e s in ce , a cco rd in g t o t h e POS IX s t a n d a rd , la rg e p a ra m e t e rs t h a t ca n n o t b e s t o re d in a 3 2 - b it re g is t e r m u s t b e p a s s e d b y re fe re n ce . A t yp ica l e xa m p le is t h e settimeofday( ) s ys t e m ca ll, wh ich m u s t re a d a 6 4 - b it s t ru ct u re .

Ho we ve r, s ys t e m ca lls t h a t h a ve m o re t h a n s ix p a ra m e t e rs e xis t . In s u ch ca s e s , a s in g le re g is t e r is u s e d t o p o in t t o a m e m o ry a re a in t h e p ro ce s s a d d re s s s p a ce t h a t co n t a in s t h e p a ra m e t e r va lu e s . Of co u rs e , p ro g ra m m e rs d o n o t h a ve t o ca re a b o u t t h is wo rka ro u n d . As wit h a n y C fu n ct io n ca ll, p a ra m e t e rs a re a u t o m a t ica lly s a ve d o n t h e s t a ck wh e n t h e wra p p e r ro u t in e is in vo ke d . Th is ro u t in e will fin d t h e a p p ro p ria t e wa y t o p a s s t h e p a ra m e t e rs t o t h e ke rn e l. Th e s ix re g is t e rs u s e d t o s t o re s ys t e m ca ll p a ra m e t e rs a re , in in cre a s in g o rd e r, eax ( fo r t h e s ys t e m ca ll n u m b e r) , ebx, ecx, edx, esi, a n d edi. As s e e n b e fo re , system_call( ) s a ve s t h e va lu e s o f t h e s e re g is t e rs o n t h e Ke rn e l Mo d e s t a ck b y u s in g t h e SAVE_ALL m a cro . Th e re fo re , wh e n t h e s ys t e m ca ll s e rvice ro u t in e g o e s t o t h e s t a ck, it fin d s t h e re t u rn a d d re s s t o system_call( ), fo llo we d b y t h e p a ra m e t e r s t o re d in ebx ( t h e firs t p a ra m e t e r o f t h e s ys t e m ca ll) , t h e p a ra m e t e r s t o re d in ecx, a n d s o o n ( s e e S e ct io n 4 . 6 . 1 . 4 ) . Th is s t a ck co n fig u ra t io n is e xa ct ly t h e s a m e a s in a n o rd in a ry fu n ct io n ca ll, a n d t h e re fo re t h e s e rvice ro u t in e ca n e a s ily re fe r t o it s p a ra m e t e rs b y u s in g t h e u s u a l C- la n g u a g e co n s t ru ct s . Le t 's lo o k a t a n e xa m p le . Th e sys_write( ) s e rvice ro u t in e , wh ich h a n d le s t h e write( ) s ys t e m ca ll, is d e cla re d a s :

int sys_write (unsigned int fd, const char * buf, unsigned int count) Th e C co m p ile r p ro d u ce s a n a s s e m b ly la n g u a g e fu n ct io n t h a t e xp e ct s t o fin d t h e fd, buf, a n d count p a ra m e t e rs o n t o p o f t h e s t a ck, rig h t b e lo w t h e re t u rn a d d re s s , in t h e lo ca t io n s u s e d t o s a ve t h e co n t e n t s o f t h e ebx, ecx, a n d edx re g is t e rs , re s p e ct ive ly.

In a fe w ca s e s , e ve n if t h e s ys t e m ca ll d o e s n 't u s e a n y p a ra m e t e rs , t h e co rre s p o n d in g s e rvice ro u t in e n e e d s t o kn o w t h e co n t e n t s o f t h e CPU re g is t e rs rig h t b e fo re t h e s ys t e m ca ll wa s is s u e d . Fo r e xa m p le , t h e do_fork( ) fu n ct io n t h a t im p le m e n t s fork( ) n e e d s t o kn o w t h e va lu e o f t h e re g is t e rs in o rd e r t o d u p lica t e t h e m in t h e ch ild p ro ce s s thread fie ld ( s e e S e ct io n 3 . 3 . 2 . 1 ) . In t h e s e ca s e s , a s in g le p a ra m e t e r o f t yp e pt_regs a llo ws t h e s e rvice ro u t in e t o a cce s s t h e va lu e s s a ve d in t h e Ke rn e l Mo d e s t a ck b y t h e SAVE_ALL m a cro ( s e e S e ct io n 4 . 6 . 1 . 5 ) :

int sys_fork (struct pt_regs regs) Th e re t u rn va lu e o f a s e rvice ro u t in e m u s t b e writ t e n in t o t h e eax re g is t e r. Th is is a u t o m a t ica lly d o n e b y t h e C co m p ile r wh e n a return n; in s t ru ct io n is e xe cu t e d .

9.2.4 Verifying the Parameters All s ys t e m ca ll p a ra m e t e rs m u s t b e ca re fu lly ch e cke d b e fo re t h e ke rn e l a t t e m p t s t o s a t is fy a u s e r re q u e s t . Th e t yp e o f ch e ck d e p e n d s b o t h o n t h e s ys t e m ca ll a n d o n t h e s p e cific p a ra m e t e r. Le t 's g o b a ck t o t h e write( ) s ys t e m ca ll in t ro d u ce d b e fo re : t h e fd p a ra m e t e r s h o u ld b e a file d e s crip t o r t h a t d e s crib e s a s p e cific file , s o sys_write( ) m u s t ch e ck wh e t h e r fd re a lly is a file d e s crip t o r o f a file p re vio u s ly o p e n e d a n d wh e t h e r t h e p ro ce s s is a llo we d t o writ e in t o it ( s e e S e ct io n 1 . 5 . 6 ) . If a n y o f t h e s e co n d it io n s a re n o t t ru e , t h e h a n d le r m u s t re t u rn a n e g a t ive va lu e — in t h is ca s e , t h e e rro r co d e -EBADF.

On e t yp e o f ch e ckin g , h o we ve r, is co m m o n t o a ll s ys t e m ca lls . Wh e n e ve r a p a ra m e t e r s p e cifie s a n a d d re s s , t h e ke rn e l m u s t ch e ck wh e t h e r it is in s id e t h e p ro ce s s a d d re s s s p a ce . Th e re a re t wo p o s s ib le wa ys t o p e rfo rm t h is ch e ck: ●



Ve rify t h a t t h e lin e a r a d d re s s b e lo n g s t o t h e p ro ce s s a d d re s s s p a ce a n d , if s o , t h a t t h e m e m o ry re g io n in clu d in g it h a s t h e p ro p e r a cce s s rig h t s . Ve rify ju s t t h a t t h e lin e a r a d d re s s is lo we r t h a n PAGE_OFFSET ( i. e . , t h a t it d o e s n 't fa ll wit h in t h e ra n g e o f in t e rva l a d d re s s e s re s e rve d t o t h e ke rn e l) .

Ea rly Lin u x ke rn e ls p e rfo rm e d t h e firs t t yp e o f ch e ckin g . Bu t it is q u it e t im e co n s u m in g s in ce it m u s t b e e xe cu t e d fo r e a ch a d d re s s p a ra m e t e r in clu d e d in a s ys t e m ca ll; fu rt h e rm o re , it is u s u a lly p o in t le s s b e ca u s e fa u lt y p ro g ra m s a re n o t ve ry co m m o n . Th e re fo re , s t a rt in g wit h Ve rs io n 2 . 2 , Lin u x e m p lo ys t h e s e co n d t yp e o f ch e ckin g . Th is is m u ch m o re e fficie n t b e ca u s e it d o e s n o t re q u ire a n y s ca n o f t h e p ro ce s s m e m o ry re g io n d e s crip t o rs . Ob vio u s ly, t h is is a ve ry co a rs e ch e ck: ve rifyin g t h a t t h e lin e a r a d d re s s is s m a lle r t h a n PAGE_OFFSET is a n e ce s s a ry b u t n o t s u fficie n t co n d it io n fo r it s va lid it y. Bu t t h e re 's n o ris k in co n fin in g t h e ke rn e l t o t h is lim it e d kin d o f ch e ck b e ca u s e o t h e r e rro rs will b e ca u g h t la t e r. Th e a p p ro a ch fo llo we d is t h u s t o d e fe r t h e re a l ch e ckin g u n t il t h e la s t p o s s ib le m o m e n t — t h a t is , u n t il t h e Pa g in g Un it t ra n s la t e s t h e lin e a r a d d re s s in t o a p h ys ica l o n e . We s h a ll d is cu s s in S e ct io n 9 . 2 . 6 , la t e r in t h is ch a p t e r, h o w t h e Pa g e Fa u lt e xce p t io n h a n d le r s u cce e d s in d e t e ct in g t h o s e b a d a d d re s s e s is s u e d in Ke rn e l Mo d e t h a t we re p a s s e d a s p a ra m e t e rs b y Us e r Mo d e p ro ce s s e s . On e m ig h t wo n d e r a t t h is p o in t wh y t h e co a rs e ch e ck is p e rfo rm e d a t a ll. Th is t yp e o f ch e ckin g is a ct u a lly cru cia l t o p re s e rve b o t h p ro ce s s a d d re s s s p a ce s a n d t h e ke rn e l a d d re s s s p a ce fro m ille g a l a cce s s e s . We s a w in Ch a p t e r 2 t h a t t h e RAM is m a p p e d s t a rt in g fro m

PAGE_OFFSET. Th is m e a n s t h a t ke rn e l ro u t in e s a re a b le t o a d d re s s a ll p a g e s p re s e n t in m e m o ry. Th u s , if t h e co a rs e ch e ck we re n o t p e rfo rm e d , a Us e r Mo d e p ro ce s s m ig h t p a s s a n a d d re s s b e lo n g in g t o t h e ke rn e l a d d re s s s p a ce a s a p a ra m e t e r a n d t h e n b e a b le t o re a d o r writ e a n y p a g e p re s e n t in m e m o ry wit h o u t ca u s in g a Pa g e Fa u lt e xce p t io n . Th e ch e ck o n a d d re s s e s p a s s e d t o s ys t e m ca lls is p e rfo rm e d b y t h e verify_area( ) fu n ct io n , wh ich a ct s o n t wo p a ra m e t e rs : addr a n d size. [ 2 ]

[2]

A t h ird p a ra m e t e r n a m e d type s p e cifie s wh e t h e r t h e s ys t e m ca ll s h o u ld re a d o r writ e t h e re fe rre d m e m o ry lo ca t io n s . It is u s e d o n ly in s ys t e m s t h a t h a ve b u g g y ve rs io n s o f t h e In t e l 8 0 4 8 6 m icro p ro ce s s o r, in wh ich writ in g in Ke rn e l Mo d e t o a writ e -

p ro t e ct e d p a g e d o e s n o t g e n e ra t e a Pa g e Fa u lt . We d o n 't d is cu s s t h is ca s e fu rt h e r. Th e fu n ct io n ch e cks t h e a d d re s s in t e rva l d e lim it e d b y addr a n d addr + size - 1, a n d is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g C fu n ct io n :

int verify_area(const void * addr, unsigned long size) { unsigned long a = (unsigned long) addr; if (a + size < a || a + size > current->addr_limit.seg) return -EFAULT; return 0; } Th e fu n ct io n firs t ve rifie s wh e t h e r addr + size, t h e h ig h e s t a d d re s s t o b e ch e cke d , is la rg e r t h a n 2 3 2 - 1 ; s in ce u n s ig n e d lo n g in t e g e rs a n d p o in t e rs a re re p re s e n t e d b y t h e GNU C co m p ile r ( gcc) a s 3 2 - b it n u m b e rs , t h is is e q u iva le n t t o ch e ckin g fo r a n o ve rflo w co n d it io n . Th e fu n ct io n a ls o ch e cks wh e t h e r addr + size e xce e d s t h e va lu e s t o re d in t h e

addr_limit.seg fie ld o f current. Th is fie ld u s u a lly h a s t h e va lu e PAGE_OFFSET fo r n o rm a l p ro ce s s e s a n d t h e va lu e 0xffffffff fo r ke rn e l t h re a d s . Th e va lu e o f t h e addr_limit.seg fie ld ca n b e d yn a m ica lly ch a n g e d b y t h e get_fs a n d set_fs m a cro s ; t h is a llo ws t h e ke rn e l t o in vo ke s ys t e m ca ll s e rvice ro u t in e s d ire ct ly a n d t o p a s s a d d re s s e s in t h e ke rn e l d a t a s e g m e n t t o t h e m . Th e access_ok m a cro p e rfo rm s t h e s a m e ch e ck a s verify_area( ). Th e o n ly d iffe re n ce is it s re t u rn va lu e : it yie ld s 1 if t h e s p e cifie d a d d re s s in t e rva l is va lid a n d 0 o t h e rwis e . Th e _

_addr_ok m a cro a ls o re t u rn s 1 if t h e s p e cifie d lin e a r a d d re s s is va lid a n d 0 o t h e rwis e . 9.2.5 Accessing the Process Address Space S ys t e m ca ll s e rvice ro u t in e s o ft e n n e e d t o re a d o r writ e d a t a co n t a in e d in t h e p ro ce s s 's a d d re s s s p a ce . Lin u x in clu d e s a s e t o f m a cro s t h a t m a ke t h is a cce s s e a s ie r. We 'll d e s crib e t wo o f t h e m , ca lle d get_user( ) a n d put_user( ). Th e firs t ca n b e u s e d t o re a d 1 , 2 , o r 4 co n s e cu t ive b yt e s fro m a n a d d re s s , wh ile t h e s e co n d ca n b e u s e d t o writ e d a t a o f t h o s e s ize s in t o a n a d d re s s . Ea ch fu n ct io n a cce p t s t wo a rg u m e n t s , a va lu e x t o t ra n s fe r a n d a va ria b le ptr. Th e s e co n d va ria b le a ls o d e t e rm in e s h o w m a n y b yt e s t o t ra n s fe r. Th u s , in get_user(x,ptr), t h e s ize o f t h e va ria b le p o in t e d t o b y ptr ca u s e s t h e fu n ct io n t o e xp a n d in t o a _ _get_user_1( ),

_ _get_user_2( ), o r _ _get_user_4( ) a s s e m b ly la n g u a g e fu n ct io n . Le t 's co n s id e r o n e o f t h e m , _ _get_user_2( ): _ _ get_user_2: addl $1, %eax jc bad_get_user movl %esp, %edx andl $0xffffe000, %edx cmpl 12(%edx), %eax jae bad_get_user 2: movzwl -1(%eax), %edx xorl %eax, %eax

ret bad_get_user: xorl %edx, %edx movl $-EFAULT, %eax ret Th e eax re g is t e r co n t a in s t h e a d d re s s ptr o f t h e firs t b yt e t o b e re a d . Th e firs t s ix in s t ru ct io n s e s s e n t ia lly p e rfo rm t h e s a m e ch e cks a s t h e verify_area( ) fu n ct io n s : t h e y e n s u re t h a t t h e 2 b yt e s t o b e re a d h a ve a d d re s s e s le s s t h a n 4 GB a s we ll a s le s s t h a n t h e addr_limit.seg fie ld o f t h e current p ro ce s s . ( Th is fie ld is s t o re d a t o ffs e t 1 2 in t h e p ro ce s s d e s crip t o r, wh ich a p p e a rs in t h e firs t o p e ra n d o f t h e cmpl in s t ru ct io n . )

If t h e a d d re s s e s a re va lid , t h e fu n ct io n e xe cu t e s t h e movzwl in s t ru ct io n t o s t o re t h e d a t a t o b e re a d in t h e t wo le a s t s ig n ifica n t b yt e s o f edx re g is t e r wh ile s e t t in g t h e h ig h - o rd e r b yt e s o f edx t o 0 ; t h e n it s e t s a 0 re t u rn co d e in eax a n d t e rm in a t e s . If t h e a d d re s s e s a re n o t va lid , t h e fu n ct io n cle a rs edx, s e t s t h e -EFAULT va lu e in t o eax, a n d t e rm in a t e s .

Th e put_user(x,ptr) m a cro is s im ila r t o t h e o n e d is cu s s e d b e fo re , e xce p t it writ e s t h e va lu e x in t o t h e p ro ce s s a d d re s s s p a ce s t a rt in g fro m a d d re s s ptr. De p e n d in g o n t h e s ize o f

x, it in vo ke s e it h e r t h e _ _put_user_asm( ) m a cro ( s ize o f 1 , 2 , o r 4 b yt e s ) o r t h e _ _put_user_u64( ) m a cro ( s ize o f 8 b yt e s ) . Bo t h m a cro s re t u rn t h e va lu e 0 in t h e eax re g is t e r if t h e y s u cce e d in writ in g t h e va lu e , a n d -EFAULT o t h e rwis e . S e ve ra l o t h e r fu n ct io n s a n d m a cro s a re a va ila b le t o a cce s s t h e p ro ce s s a d d re s s s p a ce in Ke rn e l Mo d e ; t h e y a re lis t e d in Ta b le 9 - 1 . No t ice t h a t m a n y o f t h e m a ls o h a ve a va ria n t p re fixe d b y t wo u n d e rs co re s ( _ _ ) . Th e o n e s wit h o u t in it ia l u n d e rs co re s t a ke e xt ra t im e t o ch e ck t h e va lid it y o f t h e lin e a r a d d re s s in t e rva l re q u e s t e d , wh ile t h e o n e s wit h t h e u n d e rs co re s b yp a s s t h a t ch e ck. Wh e n e ve r t h e ke rn e l m u s t re p e a t e d ly a cce s s t h e s a m e m e m o ry a re a in t h e p ro ce s s a d d re s s s p a ce , it is m o re e fficie n t t o ch e ck t h e a d d re s s o n ce a t t h e s t a rt a n d t h e n a cce s s t h e p ro ce s s a re a wit h o u t m a kin g a n y fu rt h e r ch e cks .

Ta b le 9 - 1 . Fu n c t io n s a n d m a c ro s t h a t a c c e s s t h e p ro c e s s a d d re s s s p a c e

Fu n c t io n

Ac t io n

get_user _ _get_user

Re a d s a n in t e g e r va lu e fro m u s e r s p a ce ( 1 , 2 , o r 4 b yt e s )

put_user _ _put_user

Writ e s a n in t e g e r va lu e t o u s e r s p a ce ( 1 , 2 , o r 4 b yt e s )

copy_from_user _ _copy_from_user

Co p ie s a b lo ck o f a rb it ra ry s ize fro m u s e r s p a ce

copy_to_user _ _copy_to_user

Co p ie s a b lo ck o f a rb it ra ry s ize t o u s e r s p a ce

strncpy_from_user _ _strncpy_from_user

Co p ie s a n u ll- t e rm in a t e d s t rin g fro m u s e r s p a ce

strlen_user strnlen_user

Re t u rn s t h e le n g t h o f a n u ll- t e rm in a t e d s t rin g in u s e r s p a ce

clear_user _ _clear_user

Fills a m e m o ry a re a in u s e r s p a ce wit h ze ro s

9.2.6 Dynamic Address Checking: The Fixup Code As s e e n p re vio u s ly, verify_area( ), access_ok, a n d _ _addr_ok m a ke o n ly a co a rs e ch e ck o n t h e va lid it y o f lin e a r a d d re s s e s p a s s e d a s p a ra m e t e rs o f a s ys t e m ca ll. S in ce t h e y d o n o t e n s u re t h a t t h e s e a d d re s s e s a re in clu d e d in t h e p ro ce s s a d d re s s s p a ce , a p ro ce s s co u ld ca u s e a Pa g e Fa u lt e xce p t io n b y p a s s in g a wro n g a d d re s s . Be fo re d e s crib in g h o w t h e ke rn e l d e t e ct s t h is t yp e o f e rro r, le t 's s p e cify t h e t h re e ca s e s in wh ich Pa g e Fa u lt e xce p t io n s m a y o ccu r in Ke rn e l Mo d e . Th e s e ca s e s m u s t b e d is t in g u is h e d b y t h e Pa g e Fa u lt h a n d le r, s in ce t h e a ct io n s t o b e t a ke n a re q u it e d iffe re n t . 1 . Th e ke rn e l a t t e m p t s t o a d d re s s a p a g e b e lo n g in g t o t h e p ro ce s s a d d re s s s p a ce , b u t e it h e r t h e co rre s p o n d in g p a g e fra m e d o e s n o t e xis t o r t h e ke rn e l t rie s t o writ e a re a d o n ly p a g e . In t h e s e ca s e s , t h e h a n d le r m u s t a llo ca t e a n d in it ia lize a n e w p a g e fra m e ( s e e t h e s e ct io n s S e ct io n 8 . 4 . 3 a n d S e ct io n 8 . 4 . 4 ) . 2 . Th e ke rn e l a d d re s s e s a p a g e b e lo n g in g t o it s a d d re s s s p a ce , b u t t h e co rre s p o n d in g Pa g e Ta b le e n t ry h a s n o t ye t b e e n in it ia lize d ( s e e S e ct io n 8 . 4 . 5 ) . In t h is ca s e , t h e ke rn e l m u s t p ro p e rly s e t u p s o m e e n t rie s in t h e Pa g e Ta b le s o f t h e cu rre n t p ro ce s s . 3 . S o m e ke rn e l fu n ct io n in clu d e s a p ro g ra m m in g b u g t h a t ca u s e s t h e e xce p t io n t o b e ra is e d wh e n t h a t p ro g ra m is e xe cu t e d ; a lt e rn a t ive ly, t h e e xce p t io n m ig h t b e ca u s e d b y a t ra n s ie n t h a rd wa re e rro r. Wh e n t h is o ccu rs , t h e h a n d le r m u s t p e rfo rm a ke rn e l o o p s ( s e e S e ct io n 8 . 4 . 1 ) . 4 . Th e ca s e in t ro d u ce d in t h is ch a p t e r: a s ys t e m ca ll s e rvice ro u t in e a t t e m p t s t o re a d o r writ e in t o a m e m o ry a re a wh o s e a d d re s s h a s b e e n p a s s e d a s a s ys t e m ca ll p a ra m e t e r, b u t t h a t a d d re s s d o e s n o t b e lo n g t o t h e p ro ce s s a d d re s s s p a ce . Th e Pa g e Fa u lt h a n d le r ca n e a s ily re co g n ize t h e firs t ca s e b y d e t e rm in in g wh e t h e r t h e fa u lt y lin e a r a d d re s s is in clu d e d in o n e o f t h e m e m o ry re g io n s o wn e d b y t h e p ro ce s s . It is a ls o a b le t o d e t e ct t h e s e co n d ca s e b y ch e ckin g wh e t h e r t h e Pa g e Ta b le s o f t h e p ro ce s s in clu d e a p ro p e r n o n - n u ll e n t ry t h a t m a p s t h e a d d re s s . Le t 's n o w e xp la in h o w t h e h a n d le r d is t in g u is h e s t h e re m a in in g t wo ca s e s .

9.2.6.1 The exception tables Th e ke y t o d e t e rm in in g t h e s o u rce o f a Pa g e Fa u lt lie s in t h e n a rro w ra n g e o f ca lls t h a t t h e ke rn e l u s e s t o a cce s s t h e p ro ce s s a d d re s s s p a ce . On ly t h e s m a ll g ro u p o f fu n ct io n s a n d m a cro s d e s crib e d in t h e p re vio u s s e ct io n a re u s e d t o a cce s s t h is a d d re s s s p a ce ; t h u s , if t h e e xce p t io n is ca u s e d b y a n in va lid p a ra m e t e r, t h e in s t ru ct io n t h a t ca u s e d it m u s t b e in clu d e d

in o n e o f t h e fu n ct io n s , o r e ls e b e g e n e ra t e d b y e xp a n d in g o n e o f t h e m a cro s . Th e n u m b e r o f t h e in s t ru ct io n s t h a t a d d re s s u s e r s p a ce is fa irly s m a ll. Th e re fo re , it d o e s n o t t a ke m u ch e ffo rt t o p u t t h e a d d re s s o f e a ch ke rn e l in s t ru ct io n t h a t a cce s s e s t h e p ro ce s s a d d re s s s p a ce in t o a s t ru ct u re ca lle d t h e e x ce p t io n t a b le . If we s u cce e d in d o in g t h is , t h e re s t is e a s y. Wh e n a Pa g e Fa u lt e xce p t io n o ccu rs in Ke rn e l Mo d e , t h e do_

page_fault( ) h a n d le r e xa m in e s t h e e xce p t io n t a b le : if it in clu d e s t h e a d d re s s o f t h e in s t ru ct io n t h a t t rig g e re d t h e e xce p t io n , t h e e rro r is ca u s e d b y a b a d s ys t e m ca ll p a ra m e t e r; o t h e rwis e , it is ca u s e d b y a m o re s e rio u s b u g . Lin u x d e fin e s s e ve ra l e xce p t io n t a b le s . Th e m a in e xce p t io n t a b le is a u t o m a t ica lly g e n e ra t e d b y t h e C co m p ile r wh e n b u ild in g t h e ke rn e l p ro g ra m im a g e . It is s t o re d in t h e _ _ex_table s e ct io n o f t h e ke rn e l co d e s e g m e n t , a n d it s s t a rt in g a n d e n d in g a d d re s s e s a re id e n t ifie d b y t wo s ym b o ls p ro d u ce d b y t h e C co m p ile r: _ _start_ _ _ex_table a n d _ _stop_ _

_ex_table. Mo re o ve r, e a ch d yn a m ica lly lo a d e d m o d u le o f t h e ke rn e l ( s e e Ap p e n d ix B) in clu d e s it s o wn lo ca l e xce p t io n t a b le . Th is t a b le is a u t o m a t ica lly g e n e ra t e d b y t h e C co m p ile r wh e n b u ild in g t h e m o d u le im a g e , a n d it is lo a d e d in t o m e m o ry wh e n t h e m o d u le is in s e rt e d in t h e ru n n in g ke rn e l. Ea ch e n t ry o f a n e xce p t io n t a b le is a n exception_table_entry s t ru ct u re t h a t h a s t wo fie ld s :

insn Th e lin e a r a d d re s s o f a n in s t ru ct io n t h a t a cce s s e s t h e p ro ce s s a d d re s s s p a ce

fixup Th e a d d re s s o f t h e a s s e m b ly la n g u a g e co d e t o b e in vo ke d wh e n a Pa g e Fa u lt e xce p t io n t rig g e re d b y t h e in s t ru ct io n lo ca t e d a t insn o ccu rs

Th e fixu p co d e co n s is t s o f a fe w a s s e m b ly la n g u a g e in s t ru ct io n s t h a t s o lve t h e p ro b le m t rig g e re d b y t h e e xce p t io n . As we s h a ll s e e la t e r in t h is s e ct io n , t h e fix u s u a lly co n s is t s o f in s e rt in g a s e q u e n ce o f in s t ru ct io n s t h a t fo rce s t h e s e rvice ro u t in e t o re t u rn a n e rro r co d e t o t h e Us e r Mo d e p ro ce s s . S u ch in s t ru ct io n s a re u s u a lly d e fin e d in t h e s a m e m a cro o r fu n ct io n t h a t a cce s s e s t h e p ro ce s s a d d re s s s p a ce ; s o m e t im e s t h e y a re p la ce d b y t h e C co m p ile r in t o a s e p a ra t e s e ct io n o f t h e ke rn e l co d e s e g m e n t ca lle d .fixup.

Th e search_exception_table( ) fu n ct io n is u s e d t o s e a rch fo r a s p e cifie d a d d re s s in a ll e xce p t io n t a b le s : if t h e a d d re s s is in clu d e d in a t a b le , t h e fu n ct io n re t u rn s t h e co rre s p o n d in g fixup a d d re s s ; o t h e rwis e , it re t u rn s 0 . Th u s t h e Pa g e Fa u lt h a n d le r do_page_fault( ) e xe cu t e s t h e fo llo win g s t a t e m e n t s :

if ((fixup = search_exception_table(regs->eip)) != 0) { regs->eip = fixup; return; }

Th e regs->eip fie ld co n t a in s t h e va lu e o f t h e eip re g is t e r s a ve d o n t h e Ke rn e l Mo d e s t a ck wh e n t h e e xce p t io n o ccu rre d . If t h e va lu e in t h e re g is t e r ( t h e in s t ru ct io n p o in t e r) is in a n e xce p t io n t a b le , do_page_fault( ) re p la ce s t h e s a ve d va lu e wit h t h e a d d re s s re t u rn e d b y

search_exception_table( ). Th e n t h e Pa g e Fa u lt h a n d le r t e rm in a t e s a n d t h e in t e rru p t e d p ro g ra m re s u m e s wit h e xe cu t io n o f t h e fixu p co d e .

9.2.6.2 Generating the exception tables and the fixup code Th e GNU As s e m b le r .section d ire ct ive a llo ws p ro g ra m m e rs t o s p e cify wh ich s e ct io n o f t h e e xe cu t a b le file co n t a in s t h e co d e t h a t fo llo ws . As we s h a ll s e e in Ch a p t e r 2 0 , a n e xe cu t a b le file in clu d e s a co d e s e g m e n t , wh ich in t u rn m a y b e s u b d ivid e d in t o s e ct io n s . Th u s , t h e fo llo win g a s s e m b ly la n g u a g e in s t ru ct io n s a d d a n e n t ry in t o a n e xce p t io n t a b le ; t h e "a" a t t rib u t e s p e cifie s t h a t t h e s e ct io n m u s t b e lo a d e d in t o m e m o ry t o g e t h e r wit h t h e re s t o f t h e ke rn e l im a g e :

.section _ _ex_table, "a" .long faulty_instruction_address, fixup_code_address .previous Th e .previous d ire ct ive fo rce s t h e a s s e m b le r t o in s e rt t h e co d e t h a t fo llo ws in t o t h e s e ct io n t h a t wa s a ct ive wh e n t h e la s t .section d ire ct ive wa s e n co u n t e re d .

Le t 's co n s id e r a g a in t h e _ _get_user_1( ), _ _get_user_2( ), a n d _ _get_user_4(

) fu n ct io n s m e n t io n e d b e fo re . Th e in s t ru ct io n s t h a t a cce s s t h e p ro ce s s a d d re s s s p a ce a re t h o s e la b e le d a s 1, 2, a n d 3: _ _get_user_1: [...] 1: movzbl (%eax), %edx [...] _ _get_user_2: [...] 2: movzwl -1(%eax), %edx [...] _ _get_user_4: [...] 3: movl -3(%eax), %edx [...] bad_get_user: xorl %edx, %edx movl $-EFAULT, %eax ret .section _ _ex_table,"a" .long 1b, bad_get_user .long 2b, bad_get_user .long 3b, bad_get_user .previous Ea ch e xce p t io n t a b le e n t ry co n s is t s o f t wo la b e ls . Th e firs t o n e is a n u m e ric la b e l wit h a b s u ffix t o in d ica t e t h a t t h e la b e l is "b a ckwa rd "; in o t h e r wo rd s , it a p p e a rs in a p re vio u s lin e o f t h e p ro g ra m . Th e fixu p co d e is co m m o n t o t h e t h re e fu n ct io n s a n d is la b e le d a s bad_get_user. If a Pa g e Fa u lt e xce p t io n is g e n e ra t e d b y t h e in s t ru ct io n s a t la b e l 1, 2, o r 3,

t h e fixu p co d e is e xe cu t e d . It s im p ly re t u rn s a n -EFAULT e rro r co d e t o t h e p ro ce s s t h a t is s u e d t h e s ys t e m ca ll. Ot h e r ke rn e l fu n ct io n s t h a t a ct in t h e Us e r Mo d e a d d re s s s p a ce u s e t h e fixu p co d e t e ch n iq u e . Co n s id e r, fo r in s t a n ce , t h e strlen_user(string) m a cro . Th is m a cro re t u rn s e it h e r t h e le n g t h o f a n u ll- t e rm in a t e d s t rin g p a s s e d a s a p a ra m e t e r in a s ys t e m ca ll o r t h e va lu e 0 o n e rro r. Th e m a cro e s s e n t ia lly yie ld s t h e fo llo win g a s s e m b ly la n g u a g e in s t ru ct io n s :

0:

movl $0, %eax movl $0x7fffffff, %ecx movl %ecx, %ebp movl string, %edi repne; scasb subl %ecx, %ebp movl %ebp, %eax

1: .section .fixup,"ax" 2: movl $0, %eax jmp 1b .previous .section _ _ex_table,"a" .long 0b, 2b .previous Th e ecx a n d ebp re g is t e rs a re in it ia lize d wit h t h e 0x7fffffff va lu e , wh ich re p re s e n t s t h e m a xim u m a llo we d le n g t h fo r t h e s t rin g in t h e Us e r Mo d e a d d re s s s p a ce . Th e repne;scasb a s s e m b ly la n g u a g e in s t ru ct io n s it e ra t ive ly s ca n t h e s t rin g p o in t e d t o b y t h e edi re g is t e r, lo o kin g fo r t h e va lu e 0 ( t h e e n d o f s t rin g \0 ch a ra ct e r) in eax. S in ce scasb d e cre m e n t s t h e

ecx re g is t e r a t e a ch it e ra t io n , t h e eax re g is t e r u lt im a t e ly s t o re s t h e t o t a l n u m b e r o f b yt e s s ca n n e d in t h e s t rin g ( t h a t is , t h e le n g t h o f t h e s t rin g ) . Th e fixu p co d e o f t h e m a cro is in s e rt e d in t o t h e .fixup s e ct io n . Th e "ax" a t t rib u t e s s p e cify t h a t t h e s e ct io n m u s t b e lo a d e d in t o m e m o ry a n d t h a t it co n t a in s e xe cu t a b le co d e . If a Pa g e Fa u lt e xce p t io n is g e n e ra t e d b y t h e in s t ru ct io n s a t la b e l 0, t h e fixu p co d e is e xe cu t e d ; it s im p ly lo a d s t h e va lu e 0 in eax — t h u s fo rcin g t h e m a cro t o re t u rn a 0 e rro r co d e in s t e a d o f t h e s t rin g le n g t h — a n d t h e n ju m p s t o t h e 1 la b e l, wh ich co rre s p o n d s t o t h e in s t ru ct io n fo llo win g t h e m a cro . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

9.3 Kernel Wrapper Routines Alt h o u g h s ys t e m ca lls a re u s e d m a in ly b y Us e r Mo d e p ro ce s s e s , t h e y ca n a ls o b e in vo ke d b y ke rn e l t h re a d s , wh ich ca n n o t u s e lib ra ry fu n ct io n s . To s im p lify t h e d e cla ra t io n s o f t h e co rre s p o n d in g wra p p e r ro u t in e s , Lin u x d e fin e s a s e t o f s e ve n m a cro s ca lle d _syscall0 t h ro u g h _syscall6.

In t h e n a m e o f e a ch m a cro , t h e n u m b e rs 0 t h ro u g h 6 co rre s p o n d t o t h e n u m b e r o f p a ra m e t e rs u s e d b y t h e s ys t e m ca ll ( e xclu d in g t h e s ys t e m ca ll n u m b e r) . Th e m a cro s a re u s e d t o d e cla re wra p p e r ro u t in e s t h a t a re n o t a lre a d y in clu d e d in t h e lib c s t a n d a rd lib ra ry ( fo r in s t a n ce , b e ca u s e t h e Lin u x s ys t e m ca ll is n o t ye t s u p p o rt e d b y t h e lib ra ry) ; h o we ve r, t h e y ca n n o t b e u s e d t o d e fin e wra p p e r ro u t in e s fo r s ys t e m ca lls t h a t h a ve m o re t h a n s ix p a ra m e t e rs ( e xclu d in g t h e s ys t e m ca ll n u m b e r) o r fo r s ys t e m ca lls t h a t yie ld n o n s t a n d a rd re t u rn va lu e s . Ea ch m a cro re q u ire s e xa ct ly 2 + 2 xn p a ra m e t e rs , wit h n b e in g t h e n u m b e r o f p a ra m e t e rs o f t h e s ys t e m ca ll. Th e firs t t wo p a ra m e t e rs s p e cify t h e re t u rn t yp e a n d t h e n a m e o f t h e s ys t e m ca ll; e a ch a d d it io n a l p a ir o f p a ra m e t e rs s p e cifie s t h e t yp e a n d t h e n a m e o f t h e co rre s p o n d in g s ys t e m ca ll p a ra m e t e r. Th u s , fo r in s t a n ce , t h e wra p p e r ro u t in e o f t h e fork(

) s ys t e m ca ll m a y b e g e n e ra t e d b y: _syscall0(int,fork) wh ile t h e wra p p e r ro u t in e o f t h e write( ) s ys t e m ca ll m a y b e g e n e ra t e d b y:

_syscall3(int,write,int,fd,const char *,buf,unsigned int,count) In t h e la t t e r ca s e , t h e m a cro yie ld s t h e fo llo win g co d e :

int write(int fd,const char * buf,unsigned int count) { long _ _res; asm("int $0x80" : "=a" (_ _res) : "0" (_ _NR_write), "b" ((long)fd), "c" ((long)buf), "d" ((long)count)); if ((unsigned long)_ _res >= (unsigned long)-125) { errno = -_ _res; _ _res = -1; } return (int) _ _res; } Th e _ _NR_write m a cro is d e rive d fro m t h e s e co n d p a ra m e t e r o f _syscall3; it e xp a n d s in t o t h e s ys t e m ca ll n u m b e r o f write( ). Wh e n co m p ilin g t h e p re ce d in g fu n ct io n , t h e fo llo win g a s s e m b ly la n g u a g e co d e is p ro d u ce d :

write: pushl %ebx movl 8(%esp), %ebx

; push ebx into stack ; put first parameter in ebx

movl 12(%esp), %ecx movl 16(%esp), %edx movl $4, %eax int $0x80 cmpl $-126, %eax jbe .L1 negl %eax movl %eax, errno movl $-1, %eax .L1: popl %ebx ret

; ; ; ; ; ; ; ; ; ; ;

put second parameter in ecx put third parameter in edx put _ _NR_write in eax invoke system call check return code if no error, jump complement the value of eax put result in errno set eax to -1 pop ebx from stack return to calling program

No t ice h o w t h e p a ra m e t e rs o f t h e write( ) fu n ct io n a re lo a d e d in t o t h e CPU re g is t e rs b e fo re t h e int $0x80 in s t ru ct io n is e xe cu t e d . Th e va lu e re t u rn e d in eax m u s t b e in t e rp re t e d a s a n e rro r co d e if it lie s b e t we e n - 1 a n d - 1 2 5 ( t h e ke rn e l a s s u m e s t h a t t h e la rg e s t e rro r co d e d e fin e d in in clu d e / a s m - i3 8 6 / e rrn o . h is 1 2 5 ) . If t h is is t h e ca s e , t h e wra p p e r ro u t in e s t o re s t h e va lu e o f -eax in errno a n d re t u rn s t h e va lu e - 1 ; o t h e rwis e , it re t u rn s t h e va lu e o f eax.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 10. Signals S ig n a ls we re in t ro d u ce d b y t h e firs t Un ix s ys t e m s t o a llo w in t e ra ct io n s b e t we e n Us e r Mo d e p ro ce s s e s ; t h e ke rn e l a ls o u s e s t h e m t o n o t ify p ro ce s s e s o f s ys t e m e ve n t s . S ig n a ls h a ve b e e n a ro u n d fo r 3 0 ye a rs wit h o n ly m in o r ch a n g e s . Th e firs t s e ct io n s o f t h is ch a p t e r e xa m in e in d e t a il h o w s ig n a ls a re h a n d le d b y t h e Lin u x ke rn e l, t h e n we d is cu s s t h e s ys t e m ca lls t h a t a llo w p ro ce s s e s t o e xch a n g e s ig n a ls . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

10.1 The Role of Signals A s ig n a l is a ve ry s h o rt m e s s a g e t h a t m a y b e s e n t t o a p ro ce s s o r a g ro u p o f p ro ce s s e s . Th e o n ly in fo rm a t io n g ive n t o t h e p ro ce s s is u s u a lly a n u m b e r id e n t ifyin g t h e s ig n a l; t h e re is n o ro o m in s t a n d a rd s ig n a ls fo r a rg u m e n t s , a m e s s a g e , o r o t h e r a cco m p a n yin g in fo rm a t io n . A s e t o f m a cro s wh o s e n a m e s s t a rt wit h t h e p re fix SIG is u s e d t o id e n t ify s ig n a ls ; we h a ve a lre a d y m a d e a fe w re fe re n ce s t o t h e m in p re vio u s ch a p t e rs . Fo r in s t a n ce , t h e SIGCHLD m a cro wa s m e n t io n e d in S e ct io n 3 . 4 . 1 . Th is m a cro , wh ich e xp a n d s in t o t h e va lu e 1 7 in Lin u x, yie ld s t h e id e n t ifie r o f t h e s ig n a l t h a t is s e n t t o a p a re n t p ro ce s s wh e n a ch ild s t o p s o r t e rm in a t e s . Th e SIGSEGV m a cro , wh ich e xp a n d s in t o t h e va lu e 1 1 , wa s m e n t io n e d in S e ct io n 8 . 4 ; it yie ld s t h e id e n t ifie r o f t h e s ig n a l t h a t is s e n t t o a p ro ce s s wh e n it m a ke s a n in va lid m e m o ry re fe re n ce . S ig n a ls s e rve t wo m a in p u rp o s e s : ● ●

To m a ke a p ro ce s s a wa re t h a t a s p e cific e ve n t h a s o ccu rre d To fo rce a p ro ce s s t o e xe cu t e a s ig n a l h a n d le r fu n ct io n in clu d e d in it s co d e

Of co u rs e , t h e t wo p u rp o s e s a re n o t m u t u a lly e xclu s ive , s in ce o ft e n a p ro ce s s m u s t re a ct t o s o m e e ve n t b y e xe cu t in g a s p e cific ro u t in e . Ta b le 1 0 - 1 lis t s t h e firs t 3 1 s ig n a ls h a n d le d b y Lin u x 2 . 4 fo r t h e 8 0 x 8 6 a rch it e ct u re ( s o m e s ig n a l n u m b e rs , s u ch t h o s e a s s o cia t e d wit h SIGCHLD o r SIGSTOP, a re a rch it e ct u re - d e p e n d e n t ; fu rt h e rm o re , s o m e s ig n a ls s u ch a s SIGSTKFLT a re d e fin e d o n ly fo r s p e cific a rch it e ct u re s ) . Th e m e a n in g s o f t h e d e fa u lt a ct io n s a re d e s crib e d in t h e n e xt s e ct io n .

Ta b le 1 0 - 1 . Th e firs t 3 1 s ig n a ls in Lin u x / i3 8 6

#

S ig n a l n a m e

D e fa u lt a c t io n

Co m m e n t

P OS I X

1

SIGHUP

Te rm in a t e

Ha n g u p co n t ro llin g t e rm in a l o r p ro ce s s

Ye s

2

SIGINT

Te rm in a t e

In t e rru p t fro m ke yb o a rd

Ye s

3

SIGQUIT

Du m p

Qu it fro m ke yb o a rd

Ye s

4

SIGILL

Du m p

Ille g a l in s t ru ct io n

Ye s

5

SIGTRAP

Du m p

Bre a kp o in t fo r d e b u g g in g

No

6

SIGABRT

Du m p

Ab n o rm a l t e rm in a t io n

Ye s

6

SIGIOT

Du m p

Eq u iva le n t t o SIGABRT

No

7

SIGBUS

Du m p

Bu s e rro r

No

8

SIGFPE

Du m p

Flo a t in g - p o in t e xce p t io n

Ye s

9

SIGKILL

Te rm in a t e

Fo rce d - p ro ce s s t e rm in a t io n

Ye s

1 0 SIGUSR1

Te rm in a t e

Ava ila b le t o p ro ce s s e s

Ye s

1 1 SIGSEGV

Du m p

In va lid m e m o ry re fe re n ce

Ye s

1 2 SIGUSR2

Te rm in a t e

Ava ila b le t o p ro ce s s e s

Ye s

1 3 SIGPIPE

Te rm in a t e

Writ e t o p ip e wit h n o re a d e rs

Ye s

1 4 SIGALRM

Te rm in a t e

Re a l- t im e r clo ck

Ye s

1 5 SIGTERM

Te rm in a t e

Pro ce s s t e rm in a t io n

Ye s

1 6 SIGSTKFLT

Te rm in a t e

Co p ro ce s s o r s t a ck e rro r

No

1 7 SIGCHLD

Ig n o re

Ch ild p ro ce s s s t o p p e d o r t e rm in a t e d

Ye s

1 8 SIGCONT

Co n t in u e

Re s u m e e xe cu t io n , if s t o p p e d

Ye s

1 9 SIGSTOP

Stop

S t o p p ro ce s s e xe cu t io n

Ye s

2 0 SIGTSTP

Stop

S t o p p ro ce s s is s u e d fro m t t y

Ye s

2 1 SIGTTIN

Stop

Ba ckg ro u n d p ro ce s s re q u ire s in p u t

Ye s

2 2 SIGTTOU

Stop

Ba ckg ro u n d p ro ce s s re q u ire s o u t p u t

Ye s

2 3 SIGURG

Ig n o re

Urg e n t co n d it io n o n s o cke t

No

2 4 SIGXCPU

Du m p

CPU t im e lim it e xce e d e d

No

2 5 SIGXFSZ

Du m p

File s ize lim it e xce e d e d

No

2 6 SIGVTALRM

Te rm in a t e

Virt u a l t im e r clo ck

No

2 7 SIGPROF

Te rm in a t e

Pro file t im e r clo ck

No

2 8 SIGWINCH

Ig n o re

Win d o w re s izin g

No

2 9 SIGIO

Te rm in a t e

I/ O n o w p o s s ib le

No

2 9 SIGPOLL

Te rm in a t e

Eq u iva le n t t o SIGIO

No

3 0 SIGPWR

Te rm in a t e

Po we r s u p p ly fa ilu re

No

3 1 SIGSYS

Du m p

Ba d s ys t e m ca ll

No

3 1 SIGUNUSED

Du m p

Eq u iva le n t t o SIGSYS

No

Be s id e s t h e re g u la r s ig n a ls d e s crib e d in t h is t a b le , t h e POS IX s t a n d a rd h a s in t ro d u ce d a n e w cla s s o f s ig n a ls d e n o t e d a s re a l- t im e s ig n a ls ; t h e ir s ig n a l n u m b e rs ra n g e fro m 3 2 t o 6 3 o n Lin u x. Th e y m a in ly d iffe r fro m re g u la r s ig n a ls b e ca u s e t h e y a re a lwa ys q u e u e d s o t h a t m u lt ip le s ig n a ls s e n t will b e re ce ive d . On t h e o t h e r h a n d , re g u la r s ig n a ls o f t h e s a m e kin d a re n o t q u e u e d : if a re g u la r s ig n a l is s e n t m a n y t im e s in a ro w, ju s t o n e o f t h e m is d e live re d t o t h e re ce ivin g p ro ce s s . Alt h o u g h t h e Lin u x ke rn e l d o e s n o t u s e re a l- t im e s ig n a ls , it fu lly s u p p o rt s t h e POS IX s t a n d a rd b y m e a n s o f s e ve ra l s p e cific s ys t e m ca lls . A n u m b e r o f s ys t e m ca lls a llo w p ro g ra m m e rs t o s e n d s ig n a ls a n d d e t e rm in e h o w t h e ir p ro ce s s e s re s p o n d t o t h e s ig n a ls t h e y re ce ive . Ta b le 1 0 - 2 s u m m a rize s t h e s e ca lls ; t h e ir b e h a vio r is d e s crib e d in d e t a il in t h e la t e r s e ct io n S e ct io n 1 0 . 4 .

Ta b le 1 0 - 2 . Th e m o s t s ig n ific a n t s y s t e m c a lls re la t e d t o s ig n a ls

S y s t e m c a ll

D e s c rip t io n

kill( )

S e n d a s ig n a l t o a p ro ce s s .

sigaction( )

Ch a n g e t h e a ct io n a s s o cia t e d wit h a s ig n a l.

signal( )

S im ila r t o sigaction( ).

sigpending( )

Ch e ck wh e t h e r t h e re a re p e n d in g s ig n a ls .

sigprocmask( )

Mo d ify t h e s e t o f b lo cke d s ig n a ls .

sigsuspend( )

Wa it fo r a s ig n a l.

rt_sigaction( )

Ch a n g e t h e a ct io n a s s o cia t e d wit h a re a l- t im e s ig n a l.

rt_sigpending( )

Ch e ck wh e t h e r t h e re a re p e n d in g re a l- t im e s ig n a ls .

rt_sigprocmask( )

Mo d ify t h e s e t o f b lo cke d re a l- t im e s ig n a ls .

rt_sigqueueinfo( )

S e n d a re a l- t im e s ig n a l t o a p ro ce s s .

rt_sigsuspend( )

Wa it fo r a re a l- t im e s ig n a l.

rt_sigtimedwait( )

S im ila r t o rt_sigsuspend( ).

An im p o rt a n t ch a ra ct e ris t ic o f s ig n a ls is t h a t t h e y m a y b e s e n t a t a n y t im e t o a p ro ce s s wh o s e s t a t e is u s u a lly u n p re d ict a b le . S ig n a ls s e n t t o a p ro ce s s t h a t is n o t cu rre n t ly e xe cu t in g m u s t b e s a ve d b y t h e ke rn e l u n t il t h a t p ro ce s s re s u m e s e xe cu t io n . Blo ckin g a s ig n a l ( d e s crib e d la t e r) re q u ire s t h a t d e live ry o f t h e s ig n a l b e h e ld o ff u n t il it is la t e r u n b lo cke d , wh ich e xa ce rb a t e s t h e p ro b le m o f s ig n a ls b e in g ra is e d b e fo re t h e y ca n b e d e live re d . Th e re fo re , t h e ke rn e l d is t in g u is h e s t wo d iffe re n t p h a s e s re la t e d t o s ig n a l t ra n s m is s io n : S ig n a l g e n e ra t io n Th e ke rn e l u p d a t e s a d a t a s t ru ct u re o f t h e d e s t in a t io n p ro ce s s t o re p re s e n t t h a t a n e w s ig n a l h a s b e e n s e n t . S ig n a l d e liv e ry Th e ke rn e l fo rce s t h e d e s t in a t io n p ro ce s s t o re a ct t o t h e s ig n a l b y ch a n g in g it s e xe cu t io n s t a t e , b y s t a rt in g t h e e xe cu t io n o f a s p e cifie d s ig n a l h a n d le r, o r b o t h . Ea ch s ig n a l g e n e ra t e d ca n b e d e live re d o n ce , a t m o s t . S ig n a ls a re co n s u m a b le re s o u rce s : o n ce t h e y h a ve b e e n d e live re d , a ll p ro ce s s d e s crip t o r in fo rm a t io n t h a t re fe rs t o t h e ir p re vio u s e xis t e n ce is ca n ce le d . S ig n a ls t h a t h a ve b e e n g e n e ra t e d b u t n o t ye t d e live re d a re ca lle d p e n d in g s ig n a ls . At a n y t im e , o n ly o n e p e n d in g s ig n a l o f a g ive n t yp e m a y e xis t fo r a p ro ce s s ; a d d it io n a l p e n d in g s ig n a ls o f t h e s a m e t yp e t o t h e s a m e p ro ce s s a re n o t q u e u e d b u t s im p ly d is ca rd e d . Re a l- t im e s ig n a ls a re d iffe re n t , t h o u g h : t h e re ca n b e s e ve ra l p e n d in g s ig n a ls o f t h e s a m e t yp e . In g e n e ra l, a s ig n a l m a y re m a in p e n d in g fo r a n u n p re d ict a b le a m o u n t o f t im e . Th e fo llo win g fa ct o rs m u s t b e t a ke n in t o co n s id e ra t io n : ●

S ig n a ls a re u s u a lly d e live re d o n ly t o t h e cu rre n t ly ru n n in g p ro ce s s ( t h a t is , b y t h e current p ro ce s s ) .



S ig n a ls o f a g ive n t yp e m a y b e s e le ct ive ly b lo ck e d b y a p ro ce s s ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 4 . 4 ) . In t h is ca s e , t h e p ro ce s s d o e s n o t re ce ive t h e s ig n a l u n t il it re m o ve s t h e b lo ck. Wh e n a p ro ce s s e xe cu t e s a s ig n a l- h a n d le r fu n ct io n , it u s u a lly m a s k s t h e co rre s p o n d in g s ig n a l—i. e . , it a u t o m a t ica lly b lo cks t h e s ig n a l u n t il t h e h a n d le r t e rm in a t e s . A s ig n a l h a n d le r t h e re fo re ca n n o t b e in t e rru p t e d b y a n o t h e r o ccu rre n ce o f t h e h a n d le d s ig n a l a n d t h e fu n ct io n d o e s n 't n e e d t o b e re - e n t ra n t .



Alt h o u g h t h e n o t io n o f s ig n a ls is in t u it ive , t h e ke rn e l im p le m e n t a t io n is ra t h e r co m p le x. Th e ke rn e l m u s t : ● ●



Re m e m b e r wh ich s ig n a ls a re b lo cke d b y e a ch p ro ce s s . Wh e n s wit ch in g fro m Ke rn e l Mo d e t o Us e r Mo d e , ch e ck wh e t h e r a s ig n a l fo r a n y p ro ce s s h a s a rrive d . Th is h a p p e n s a t a lm o s t e ve ry t im e r in t e rru p t ( ro u g h ly e ve ry 1 0 m s ) . De t e rm in e wh e t h e r t h e s ig n a l ca n b e ig n o re d . Th is h a p p e n s wh e n a ll o f t h e fo llo win g co n d it io n s a re fu lfille d : ❍

Th e d e s t in a t io n p ro ce s s is n o t t ra ce d b y a n o t h e r p ro ce s s ( t h e PT_PTRACED fla g in t h e p ro ce s s d e s crip t o r ptrace fie ld is e q u a l t o 0 ) . [ 1 ]

[1]

If a p ro ce s s re ce ive s a s ig n a l wh ile it is b e in g t ra ce d , t h e ke rn e l s t o p s t h e p ro ce s s a n d n o t ifie s t h e t ra cin g p ro ce s s b y s e n d in g a SIGCHLD s ig n a l t o it . Th e t ra cin g p ro ce s s m a y, in t u rn , re s u m e e xe cu t io n o f t h e t ra ce d p ro ce s s b y m e a n s o f a SIGCONT s ig n a l. ● ●

Th e s ig n a l is n o t b lo cke d b y t h e d e s t in a t io n p ro ce s s . Th e s ig n a l is b e in g ig n o re d b y t h e d e s t in a t io n p ro ce s s ( e it h e r b e ca u s e t h e p ro ce s s e xp licit ly ig n o re d it o r b e ca u s e t h e p ro ce s s d id n o t ch a n g e t h e d e fa u lt a ct io n o f t h e s ig n a l a n d t h a t a ct io n is "ig n o re ") .

● Ha n d le t h e s ig n a l, wh ich m a y re q u ire s wit ch in g t h e p ro ce s s t o a h a n d le r fu n ct io n a t a n y p o in t d u rin g it s e xe cu t io n a n d re s t o rin g t h e o rig in a l e xe cu t io n co n t e xt a ft e r t h e fu n ct io n re t u rn s .

Mo re o ve r, Lin u x m u s t t a ke in t o a cco u n t t h e d iffe re n t s e m a n t ics fo r s ig n a ls a d o p t e d b y BS D a n d S ys t e m V; fu rt h e rm o re , it m u s t co m p ly wit h t h e ra t h e r cu m b e rs o m e POS IX re q u ire m e n t s .

10.1.1 Actions Performed upon Delivering a Signal Th e re a re t h re e wa ys in wh ich a p ro ce s s ca n re s p o n d t o a s ig n a l: 1 . Exp licit ly ig n o re t h e s ig n a l. 2 . Exe cu t e t h e d e fa u lt a ct io n a s s o cia t e d wit h t h e s ig n a l ( s e e Ta b le 1 0 - 1 ) . Th is a ct io n , wh ich is p re d e fin e d b y t h e ke rn e l, d e p e n d s o n t h e s ig n a l t yp e a n d m a y b e a n y o n e o f t h e fo llo win g : Te rm in a t e

Th e p ro ce s s is t e rm in a t e d ( kille d ) .

Du m p

Th e p ro ce s s is t e rm in a t e d ( kille d ) a n d a core file co n t a in in g it s e xe cu t io n co n t e xt is cre a t e d , if p o s s ib le ; t h is file m a y b e u s e d fo r d e b u g p u rp o s e s .

Ig n o re

Th e s ig n a l is ig n o re d .

S top

Th e p ro ce s s is s t o p p e d —i. e . , p u t in t h e TASK_STOPPED s t a t e ( s e e S e ct io n 3 . 2 . 1 ) .

Co n t in u e

If t h e p ro ce s s is s t o p p e d ( TASK_STOPPED) , it is p u t in t o t h e TASK_RUNNING s t a t e .

3 . Ca t ch t h e s ig n a l b y in vo kin g a co rre s p o n d in g s ig n a l- h a n d le r fu n ct io n . No t ice t h a t b lo ckin g a s ig n a l is d iffe re n t fro m ig n o rin g it . A s ig n a l is n o t d e live re d a s lo n g a s it is b lo cke d ; it is d e live re d o n ly a ft e r it h a s b e e n u n b lo cke d . An ig n o re d s ig n a l is a lwa ys d e live re d , a n d t h e re is n o fu rt h e r a ct io n . Th e SIGKILL a n d SIGSTOP s ig n a ls ca n n o t b e ig n o re d , ca u g h t , o r b lo cke d , a n d t h e ir d e fa u lt a ct io n s m u s t a lwa ys b e e xe cu t e d . Th e re fo re , SIGKILL a n d SIGSTOP a llo w a u s e r wit h a p p ro p ria t e p rivile g e s t o t e rm in a t e a n d t o s t o p , re s p e ct ive ly, a n y p ro ce s s , [ 2 ] re g a rd le s s o f t h e d e fe n s e s t a ke n b y t h e p ro g ra m it is e xe cu t in g . [2]

Th e re a re t wo e xce p t io n s : it is n o t p o s s ib le t o s e n d a s ig n a l t o p ro ce s s 0 ( s w a p p e r) , a n d s ig n a ls s e n t t o p ro ce s s 1 ( in it ) a re a lwa ys d is ca rd e d u n le s s t h e y a re ca u g h t . Th e re fo re , p ro ce s s 0 n e ve r d ie s , wh ile p ro ce s s 1 d ie s o n ly wh e n t h e in it p ro g ra m t e rm in a t e s . 10.1.2 Data Structures Associated with Signals Fo r a n y p ro ce s s in t h e s ys t e m , t h e ke rn e l m u s t ke e p t ra ck o f wh a t s ig n a ls a re cu rre n t ly p e n d in g o r m a s ke d , a s we ll a s h o w t o h a n d le e ve ry s ig n a l. To d o t h is , it u s e s s e ve ra l d a t a s t ru ct u re s a cce s s ib le fro m t h e p ro ce s s o r d e s crip t o r. Th e m o s t s ig n ifica n t o n e s a re s h o wn in Fig u re 1 0 - 1 . Fig u re 1 0 - 1 . Th e m o s t s ig n ific a n t d a t a s t ru c t u re s re la t e d t o s ig n a l h a n d lin g

Th e fie ld s o f t h e p ro ce s s d e s crip t o r re la t e d t o s ig n a l h a n d lin g a re lis t e d in Ta b le 1 0 - 3 .

Ta b le 1 0 - 3 . P ro c e s s d e s c rip t o r fie ld s re la t e d t o s ig n a l h a n d lin g

Ty p e

Na m e

D e s c rip t io n

spinlock_t

sigmask_lock S p in lo ck p ro t e ct in g pending a n d blocked

struct signal_struct * sig

Po in t e r t o t h e p ro ce s s 's s ig n a l d e s crip t o r

sigset_t

blocked

Ma s k o f b lo cke d s ig n a ls

struct sigpending

pending

Da t a s t ru ct u re s t o rin g t h e p e n d in g s ig n a ls

unsigned long

sas_ss_sp

Ad d re s s o f a lt e rn a t e s ig n a l h a n d le r s t a ck

size_t

sas_ss_size

S ize o f a lt e rn a t e s ig n a l h a n d le r s t a ck

int (*) (void *)

notifier

Po in t e r t o a fu n ct io n u s e d b y a d e vice d rive r t o b lo ck s o m e s ig n a ls o f t h e p ro ce s s

void *

notifier_data Po in t e r t o d a t a t h a t m ig h t b e u s e d b y t h e n o t ifie r fu n ct io n ( p re vio u s fie ld o f t a b le )

sigset_t *

notifier_mask Bit m a s k o f s ig n a ls b lo cke d b y a d e vice d rive r t h ro u g h a n o t ifie r fu n ct io n

Th e blocked fie ld s t o re s t h e s ig n a ls cu rre n t ly m a s ke d o u t b y t h e p ro ce s s . It is a sigset_t a rra y o f b it s , o n e fo r e a ch s ig n a l t yp e :

typedef struct { unsigned long sig[2]; } sigset_t; S in ce e a ch unsigned long n u m b e r co n s is t s o f 3 2 b it s , t h e m a xim u m n u m b e r o f s ig n a ls t h a t m a y b e d e cla re d in Lin u x is 6 4 ( t h e _NSIG m a cro s p e cifie s t h is va lu e ) . No s ig n a l ca n h a ve n u m b e r 0 , s o t h e s ig n a l n u m b e r co rre s p o n d s t o t h e in d e x o f t h e co rre s p o n d in g b it in a sigset_t va ria b le p lu s o n e . Nu m b e rs b e t we e n 1 a n d 3 1 co rre s p o n d t o t h e s ig n a ls lis t e d in Ta b le 1 0 - 1 , wh ile n u m b e rs b e t we e n 3 2 a n d 6 4 co rre s p o n d t o re a l- t im e s ig n a ls . Th e sig fie ld o f t h e p ro ce s s d e s crip t o r p o in t s t o a s ig n a l d e s crip t o r, wh ich d e s crib e s h o w e a ch s ig n a l m u s t b e h a n d le d b y t h e p ro ce s s . Th e d e s crip t o r is s t o re d in a signal_struct s t ru ct u re , wh ich is d e fin e d a s fo llo ws :

struct signal_struct { atomic_t struct k_sigaction

count; action[64];

spinlock_t

siglock;

}; As m e n t io n e d in S e ct io n 3 . 4 . 1 , t h is s t ru ct u re m a y b e s h a re d b y s e ve ra l p ro ce s s e s b y in vo kin g t h e clone( ) s ys t e m ca ll wit h t h e CLONE_SIGHAND fla g s e t . [ 3 ] Th e count fie ld s p e cifie s t h e n u m b e r o f p ro ce s s e s t h a t s h a re t h e signal_struct s t ru ct u re , wh ile t h e siglock fie ld is u s e d t o e n s u re e xclu s ive a cce s s t o it s fie ld s . Th e action fie ld is a n a rra y o f 6 4 k_sigaction s t ru ct u re s t h a t s p e cify h o w e a ch s ig n a l m u s t b e h a n d le d . [3]

If t h is is n o t d o n e , a b o u t 1 , 3 0 0 b yt e s a re a d d e d t o t h e p ro ce s s d a t a s t ru ct u re s ju s t t o t a ke ca re o f s ig n a l h a n d lin g . S o m e a rch it e ct u re s a s s ig n p ro p e rt ie s t o a s ig n a l t h a t a re vis ib le o n ly t o t h e ke rn e l. Th u s , t h e p ro p e rt ie s o f a s ig n a l a re s t o re d in a k_sigaction s t ru ct u re , wh ich co n t a in s b o t h t h e p ro p e rt ie s h id d e n fro m t h e Us e r Mo d e p ro ce s s a n d t h e m o re fa m ilia r sigaction s t ru ct u re t h a t h o ld s a ll t h e p ro p e rt ie s a Us e r Mo d e p ro ce s s ca n s e e . Act u a lly, o n t h e 8 0 x 8 6 p la t fo rm , a ll s ig n a l p ro p e rt ie s a re vis ib le t o Us e r Mo d e p ro ce s s e s . Th u s t h e k_sigaction s t ru ct u re s im p ly re d u ce s t o a s in g le

sa s t ru ct u re o f t yp e sigaction, wh ich in clu d e s t h e fo llo win g fie ld s : sa_handler o r sa_sigaction Bo t h n a m e s re fe r t o t h e s a m e fie ld o f t h e s t ru ct u re , wh ich s p e cifie s t h e t yp e o f a ct io n t o b e p e rfo rm e d ; it s va lu e ca n b e e it h e r a p o in t e r t o t h e s ig n a l h a n d le r, SIG_DFL ( t h a t is , t h e va lu e 0 ) t o s p e cify t h a t t h e d e fa u lt a ct io n is p e rfo rm e d , o r SIG_IGN ( t h a t is , t h e va lu e 1 ) t o s p e cify t h a t t h e s ig n a l is ig n o re d . Th e t wo d iffe re n t n a m e s o f t h is fie ld co rre s p o n d s t o t wo d iffe re n t t yp e s o f s ig n a l h a n d le r ( s e e S e ct io n 1 0 . 4 . 2 la t e r in t h is ch a p t e r) .

sa_flags Th is s e t o f fla g s s p e cifie s h o w t h e s ig n a l m u s t b e h a n d le d ; s o m e o f t h e m a re lis t e d in Ta b le 1 0 - 4 .

sa_mask Th is sigset_t va ria b le s p e cifie s t h e s ig n a ls t o b e m a s ke d wh e n ru n n in g t h e s ig n a l h a n d le r.

Ta b le 1 0 - 4 . Fla g s s p e c ify in g h o w t o h a n d le a s ig n a l

Fla g N a m e

SA_NOCLDSTOP

D e s c rip t io n

Do n o t s e n d SIGCHLD t o t h e p a re n t wh e n t h e p ro ce s s is stoppe d.

SA_NODEFER, SA_NOMASK

Do n o t m a s k t h e s ig n a l wh ile e xe cu t in g t h e s ig n a l h a n d le r.

SA_RESETHAND, SA_ONESHOT Re s e t t o d e fa u lt a ct io n a ft e r e xe cu t in g t h e s ig n a l h a n d le r.

SA_ONSTACK

Us e a n a lt e rn a t e s t a ck fo r t h e s ig n a l h a n d le r ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 3 . 3 ) .

SA_RESTART

In t e rru p t e d s ys t e m ca lls a re a u t o m a t ica lly re s t a rt e d ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 3 . 4 ) .

SA_SIGINFO

Pro vid e a d d it io n a l in fo rm a t io n t o t h e s ig n a l h a n d le r ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 4 . 2 ) .

Th e pending fie ld o f t h e p ro ce s s d e s crip t o r is u s e d t o ke e p t ra ck o f wh a t s ig n a ls a re cu rre n t ly p e n d in g . It co n s is t s o f a struct sigpending d a t a s t ru ct u re , wh ich is d e fin e d a s fo llo ws :

struct sigpending { struct sigqueue * head, ** tail; sigset_t signal; } Th e signal fie ld is a b it m a s k s p e cifyin g t h e p e n d in g s ig n a ls fo r t h e p ro ce s s , wh ile t h e head a n d tail fie ld s p o in t t o t h e firs t a n d la s t it e m s o f a p e n d in g s ig n a l q u e u e . Th is q u e u e is im p le m e n t e d t h ro u g h a lis t o f struct sigqueue d a t a s t ru ct u re s :

struct sigqueue { struct sigqueue * next; siginfo_t info; } Th e nr_queued_signals va ria b le s t o re s t h e n u m b e r o f it e m s in t h e q u e u e , wh ile t h e

max_queued_signals d e fin e s t h e m a xim u m le n g t h o f t h e q u e u e ( wh ich is 1 , 0 2 4 b y d e fa u lt , b u t t h e s ys t e m a d m in is t ra t o r ca n ch a n g e t h is va lu e e it h e r b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ rt s ig - m a x file o r b y is s u in g a s u it a b le sysctl( ) s ys t e m ca ll) .

Th e siginfo_t d a t a s t ru ct u re is a 1 2 8 - b yt e d a t a s t ru ct u re t h a t s t o re s in fo rm a t io n a b o u t a n o ccu rre n ce o f a s p e cific s ig n a l; it in clu d e s t h e fo llo win g fie ld s :

si_signo Th e s ig n a l n u m b e r.

si_errno Th e e rro r co d e o f t h e in s t ru ct io n t h a t ca u s e d t h e s ig n a l t o b e ra is e d , o r 0 if t h e re wa s n o e rro r.

si_code A co d e id e n t ifyin g wh o ra is e d t h e s ig n a l ( s e e Ta b le 1 0 - 5 ) .

Ta b le 1 0 - 5 . Th e m o s t s ig n ific a n t s ig n a l s e n d e r c o d e s

Co d e N a m e

Se nde r

SI_USER

kill( ) a n d raise( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 4 )

SI_KERNEL

Ge n e ric ke rn e l fu n ct io n

SI_TIMER

Tim e r e xp ira t io n

SI_ASYNCIO

As yn ch ro n o u s I/ O co m p le t io n

_sifields A u n io n s t o rin g in fo rm a t io n d e p e n d in g o n t h e t yp e o f s ig n a l. Fo r in s t a n ce , t h e siginfo_t d a t a s t ru ct u re re la t ive t o a n o ccu rre n ce o f t h e SIGKILL s ig n a l re co rd s t h e PID a n d t h e UID o f t h e s e n d e r p ro ce s s h e re ; co n ve rs e ly, t h e d a t a s t ru ct u re re la t ive t o a n o ccu rre n ce o f t h e SIGSEGV s ig n a l s t o re s t h e m e m o ry a d d re s s wh o s e a cce s s ca u s e d t h e s ig n a l t o b e ra is e d .

10.1.3 Operations on Signal Data Structures S e ve ra l fu n ct io n s a n d m a cro s a re u s e d b y t h e ke rn e l t o h a n d le s ig n a ls . In t h e fo llo win g d e s crip t io n , set is a p o in t e r t o a sigset_t va ria b le , nsig is t h e n u m b e r o f a s ig n a l, a n d mask is a n unsigned long b it m a s k.

sigemptyset(set) a n d sigfillset(set) S e t s t h e b it s in t h e sigset_t va ria b le t o 0 o r 1 , re s p e ct ive ly.

sigaddset(set,nsig) a n d sigdelset(set,nsig) S e t s t h e b it o f t h e sigset_t va ria b le co rre s p o n d in g t o s ig n a l nsig t o 1 o r 0 , re s p e ct ive ly. In p ra ct ice , sigaddset( ) re d u ce s t o : set->sig[(nsig - 1) / 32] |= 1UL sig[(nsig - 1) / 32] &= ~(1UL sig[0] |= mask;

a nd to: set->sig[0] &= ~mask;

sigismember(set,nsig) Re t u rn s t h e va lu e o f t h e b it o f t h e sigset_t va ria b le co rre s p o n d in g t o t h e s ig n a l nsig. In p ra ct ice , t h is fu n ct io n re d u ce s t o : return 1 & (set->sig[(nsig-1) / 32] >> ((nsig-1) % 32));

sigmask(nsig) Yie ld s t h e b it in d e x o f t h e s ig n a l nsig. In o t h e r wo rd s , if t h e ke rn e l n e e d s t o s e t , cle a r, o r t e s t a b it in a n e le m e n t o f sigset_t t h a t co rre s p o n d s t o a p a rt icu la r s ig n a l, it ca n d e rive t h e p ro p e r b it t h ro u g h t h is m a cro .

sigandsets(d,s1,s2), sigorsets(d,s1,s2), a n d signandsets(d,s1,s2) Pe rfo rm s a lo g ica l AND, a lo g ica l OR, a n d a lo g ica l NAND, re s p e ct ive ly, b e t we e n t h e sigset_t va ria b le s t o wh ich s1 a n d s2 p o in t ; t h e re s u lt is s t o re d in t h e sigset_t va ria b le t o wh ich d p o in t s .

sigtestsetmask(set,mask) Re t u rn s t h e va lu e 1 if a n y o f t h e b it s in t h e sigset_t va ria b le t h a t co rre s p o n d t o t h e b it s s e t t o 1 in mask is s e t ; it re t u rn s 0 o t h e rwis e . It ca n b e u s e d o n ly wit h s ig n a ls t h a t h a ve a n u m b e r b e t we e n 1 a n d 3 2 .

siginitset(set,mask) In it ia lize s t h e lo w b it s o f t h e sigset_t va ria b le co rre s p o n d in g t o s ig n a ls b e t we e n 1 a n d 3 2 wit h t h e b it s co n t a in e d in mask, a n d cle a rs t h e b it s co rre s p o n d in g t o s ig n a ls b e t we e n 33 and 63.

siginitsetinv(set,mask) In it ia lize s t h e lo w b it s o f t h e sigset_t va ria b le co rre s p o n d in g t o s ig n a ls b e t we e n 1 a n d 3 2 wit h t h e co m p le m e n t o f t h e b it s co n t a in e d in mask, a n d s e t s t h e b it s co rre s p o n d in g t o s ig n a ls b e t we e n 3 3 a n d 6 3 .

signal_pending(p) Re t u rn s t h e va lu e 1 ( t ru e ) if t h e p ro ce s s id e n t ifie d b y t h e *p p ro ce s s d e s crip t o r h a s

n o n b lo cke d p e n d in g s ig n a ls , a n d re t u rn s t h e va lu e 0 ( fa ls e ) if it d o e s n 't . Th e fu n ct io n is im p le m e n t e d a s a s im p le ch e ck o n t h e sigpending fie ld o f t h e p ro ce s s d e s crip t o r.

recalc_sigpending(t) Ch e cks wh e t h e r t h e p ro ce s s id e n t ifie d b y t h e p ro ce s s d e s crip t o r a t *t h a s n o n b lo cke d p e n d in g s ig n a ls b y lo o kin g a t t h e sig a n d blocked fie ld s o f t h e p ro ce s s , a n d t h e n s e t s t h e sigpending fie ld t o o r 1 a s fo llo ws : ready = t->pending.signal.sig[1] & ~t->blocked.sig[1]; ready |= t->pending.signal.sig[0] & ~t->blocked.sig[0]; t->sigpending = (ready != 0);

flush_signals(t) De le t e s a ll s ig n a ls s e n t t o t h e p ro ce s s id e n t ifie d b y t h e p ro ce s s d e s crip t o r a t *t. Th is is d o n e b y cle a rin g b o t h t h e t->sigpending a n d t h e t->pending.signal fie ld s a n d b y e m p t yin g t h e q u e u e o f p e n d in g s ig n a ls .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

10.2 Generating a Signal Wh e n a s ig n a l is s e n t t o a p ro ce s s , e it h e r fro m t h e ke rn e l o r fro m a n o t h e r p ro ce s s , t h e ke rn e l g e n e ra t e s it b y in vo kin g t h e send_sig_info( ), send_sig( ), force_sig( ), o r

force_sig_info( ) fu n ct io n s . Th e s e a cco m p lis h t h e firs t p h a s e o f s ig n a l h a n d lin g d e s crib e d e a rlie r in S e ct io n 1 0 . 1 , u p d a t in g t h e p ro ce s s d e s crip t o r a s n e e d e d . Th e y d o n o t d ire ct ly p e rfo rm t h e s e co n d p h a s e o f d e live rin g t h e s ig n a l b u t , d e p e n d in g o n t h e t yp e o f s ig n a l a n d t h e s t a t e o f t h e p ro ce s s , m a y wa ke u p t h e p ro ce s s a n d fo rce it t o re ce ive t h e s ig n a l.

10.2.1 The send_sig_info( ) and send_sig( ) Functions Th e send_sig_info( ) fu n ct io n a ct s o n t h re e p a ra m e t e rs :

sig Th e s ig n a l n u m b e r.

info Eit h e r t h e a d d re s s o f a siginfo_t t a b le o r o n e o f t wo s p e cia l va lu e s . 0 m e a n s t h a t t h e s ig n a l h a s b e e n s e n t b y a Us e r Mo d e p ro ce s s , wh ile 1 m e a n s t h a t it h a s b e e n s e n t b y t h e ke rn e l.

t A p o in t e r t o t h e d e s crip t o r o f t h e d e s t in a t io n p ro ce s s . Th e send_sig_info( ) fu n ct io n s t a rt s b y ch e ckin g wh e t h e r t h e p a ra m e t e rs a re co rre ct :

if (sig < 0 || sig > 64) return -EINVAL; Th e fu n ct io n t h e n ch e cks if t h e s ig n a l is b e in g s e n t b y a Us e r Mo d e p ro ce s s . Th is o ccu rs wh e n info is e q u a l t o 0 o r wh e n t h e si_code fie ld o f t h e siginfo_t t a b le is n e g a t ive o r 0 ( p o s it ive va lu e s o f t h is fie ld m e a n t h a t t h e s ig n a l wa s s e n t b y s o m e ke rn e l fu n ct io n ) :

if ((!info || ((unsigned long)info != 1 && (info->si_code session != t->session)) && (current->euid ^ t->suid) && (current->euid ^ t->uid) && (current->uid ^ t->suid) && (current->uid ^ t->uid) && !capable(CAP_KILL)) return -EPERM; If t h e s ig n a l is s e n t b y a Us e r Mo d e p ro ce s s , t h e fu n ct io n d e t e rm in e s wh e t h e r t h e o p e ra t io n is a llo we d . Th e s ig n a l is d e live re d o n ly if a t le a s t o n e o f t h e fo llo win g co n d it io n s h o ld s : ●

Th e o wn e r o f t h e s e n d in g p ro ce s s h a s t h e p ro p e r ca p a b ilit y ( u s u a lly, t h is s im p ly m e a n s t h e s ig n a l wa s is s u e d b y t h e s ys t e m a d m in is t ra t o r; s e e Ch a p t e r 2 0 ) .



Th e s ig n a l is SIGCONT a n d t h e d e s t in a t io n p ro ce s s is in t h e s a m e lo g in s e s s io n o f t h e s e n d in g



p ro ce s s . Bo t h p ro ce s s e s b e lo n g t o t h e s a m e u s e r.

If t h e sig p a ra m e t e r h a s t h e va lu e 0 , t h e fu n ct io n re t u rn s im m e d ia t e ly wit h o u t g e n e ra t in g a n y s ig n a l. S in ce 0 is n o t a va lid s ig n a l n u m b e r, it is u s e d t o a llo w t h e s e n d in g p ro ce s s t o ch e ck wh e t h e r it h a s t h e re q u ire d p rivile g e s t o s e n d a s ig n a l t o t h e d e s t in a t io n p ro ce s s . Th e fu n ct io n a ls o re t u rn s if t h e d e s t in a t io n p ro ce s s is in t h e TASK_ZOMBIE s t a t e , in d ica t e d b y ch e ckin g wh e t h e r it s siginfo_t t a b le h a s b e e n re le a s e d :

if (!sig || !t->sig) return 0; No w t h e ke rn e l h a s fin is h e d t h e p re lim in a ry ch e cks , a n d it is g o in g t o fid d le wit h t h e s ig n a l- re la t e d d a t a s t ru ct u re s . To a vo id ra ce co n d it io n s , it d is a b le s t h e in t e rru p t s a n d a cq u ire s t h e s ig n a l s p in lo ck o f t h e d e s t in a t io n p ro ce s s :

spin_lock_irqsave(&t->sigmask_lock, flags); S o m e t yp e s o f s ig n a ls m ig h t n u llify o t h e r p e n d in g s ig n a ls fo r t h e d e s t in a t io n p ro ce s s . Th e re fo re , t h e fu n ct io n ch e cks wh e t h e r o n e o f t h e fo llo win g ca s e s o ccu rs : ●

sig is a SIGKILL o r SIGCONT s ig n a l. If t h e d e s t in a t io n p ro ce s s is s t o p p e d , it is p u t in t h e TASK_RUNNING s t a t e s o t h a t it is a b le t o e it h e r e xe cu t e t h e do_exit( ) fu n ct io n o r ju s t co n t in u e it s e xe cu t io n ; m o re o ve r, if t h e d e s t in a t io n p ro ce s s h a s SIGSTOP, SIGTSTP, SIGTTOU, o r SIGTTIN p e n d in g s ig n a ls , t h e y a re re m o ve d : if (t->state == TASK_STOPPED) wake_up_process(t); t->exit_code = 0; rm_sig_from_queue(SIGSTOP, t); rm_sig_from_queue(SIGTSTP, t); rm_sig_from_queue(SIGTTOU, t); rm_sig_from_queue(SIGTTIN, t); Th e rm_sig_from_queue( ) fu n ct io n cle a rs t h e b it in t->pending.signal a s s o cia t e d wit h t h e s ig n a l n u m b e r p a s s e d a s firs t a rg u m e n t a n d re m o ve s a n y it e m in t h e p e n d in g s ig n a l q u e u e o f t h e p ro ce s s t h a t co rre s p o n d s t o t h a t s ig n a l n u m b e r.



sig is a SIGSTOP, SIGTSTP, SIGTTIN, o r SIGTTOU s ig n a l. If t h e d e s t in a t io n p ro ce s s h a s a p e n d in g SIGCONT s ig n a l, it is re m o ve d : rm_sig_from_queue(SIGCONT, t);

Ne xt , send_sig_info( ) ch e cks wh e t h e r t h e n e w s ig n a l ca n b e h a n d le d im m e d ia t e ly. In t h is ca s e , t h e fu n ct io n a ls o t a ke s ca re o f t h e d e live rin g p h a s e o f t h e s ig n a l:

if (ignored_signal(sig, t)) { spin_unlock_irqrestore(&t->sigmask_lock, flags); return 0; } Th e ignored_signal( ) fu n ct io n re t u rn s t h e va lu e 1 wh e n a ll t h re e co n d it io n s fo r ig n o rin g a s ig n a l t h a t a re m e n t io n e d in S e ct io n 1 0 . 1 a re s a t is fie d . Ho we ve r, t o fu lfill a POS IX re q u ire m e n t , t h e

SIGCHLD s ig n a l is h a n d le d s p e cia lly. POS IX d is t in g u is h e s b e t we e n e xp licit ly s e t t in g t h e "ig n o re " a ct io n fo r t h e SIGCHLD s ig n a l a n d le a vin g t h e d e fa u lt in p la ce ( e ve n if t h e d e fa u lt is t o ig n o re t h e s ig n a l) . To le t t h e ke rn e l cle a n u p a t e rm in a t e d ch ild p ro ce s s a n d p re ve n t it fro m b e co m in g a zo m b ie ( s e e S e ct io n 3 . 5 . 2 ) , t h e p a re n t m u s t e xp licit ly s e t t h e a ct io n t o "ig n o re " t h e s ig n a l. S o

ignored_signal( ) h a n d le s t h is ca s e a s fo llo ws : if t h e s ig n a l is e xp licit ly ig n o re d , ignored_signal( ) re t u rn s 0 , b u t if t h e d e fa u lt a ct io n wa s "ig n o re " a n d t h e p ro ce s s d id n 't ch a n g e t h a t d e fa u lt , ignored_signal( ) re t u rn s 1 . If ignored_signal( ) re t u rn s 1 , t h e siginfo_t t a b le o f t h e d e s t in a t io n p ro ce s s m u s t n o t b e u p d a t e d , a n d t h e send_sig_info( ) fu n ct io n t e rm in a t e s . S in ce t h e s ig n a l is n o lo n g e r p e n d in g , it h a s b e e n e ffe ct ive ly d e live re d t o t h e d e s t in a t io n p ro ce s s , e ve n if t h e p ro ce s s n e ve r s e e s it . If ignored_signal( ) re t u rn s 0 , t h e p h a s e o f s ig n a l d e live rin g h a s t o b e d e fe rre d , t h e re fo re

send_sig_info( ) m a y h a ve t o m o d ify t h e d a t a s t ru ct u re s o f t h e d e s t in a t io n p ro ce s s t o le t it kn o w la t e r t h a t a n e w s ig n a l h a s b e e n s e n t t o it . Ho we ve r, if t h e s ig n a l b e in g h a n d le d wa s a lre a d y p e n d in g , t h e send_sig_info( ) fu n ct io n ca n s im p ly t e rm in a t e . In fa ct , t h e re ca n b e a t m o s t o n e o ccu rre n ce o f a n y re g u la r s ig n a l t yp e in t h e p e n d in g s ig n a l q u e u e o f a p ro ce s s b e ca u s e re g u la r s ig n a l o ccu rre n ce s a re n o t re a lly q u e u e d :

if (sig < 32 && sigismember(&t->pending.signal, sig)) { spin_unlock_irqrestore(&t->sigmask_lock, flags); return 0; } If it p ro ce e d s , t h e send_sig_info( ) fu n ct io n m u s t in s e rt a n e w it e m in t h e p e n d in g s ig n a l q u e u e o f t h e d e s t in a t io n p ro ce s s . Th is is a ch ie ve d b y in vo kin g t h e send_signal( ) fu n ct io n :

retval = send_signal(sig, info, &t->pending); In t u rn , t h e send_signal( ) fu n ct io n ch e cks t h e le n g t h o f t h e p e n d in g s ig n a l q u e u e a n d a p p e n d s a n e w sigqueue d a t a s t ru ct u re :

if (atomic_read(&nr_queued_signals) < max_queued_signals) { q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC); atomic_inc(&nr_queued_signals); q->next = NULL; *(t->pending.tail) = q; t->pending.tail = &q->next; Th e n t h e send_sig_info( ) fu n ct io n fills t h e siginfo_t t a b le in s id e t h e n e w q u e u e it e m :

if ((unsigned long)info == 0) { q->info.si_signo = sig; q->info.si_errno = 0; q->info.si_code = SI_USER; q->info._sifields._kill._pid = current->pid; q->info._sifields._kill._uid = current->uid; } else if ((unsigned long)info == 1) { q->info.si_signo = sig; q->info.si_errno = 0; q->info.si_code = SI_KERNEL; q->info._sifields._kill._pid = 0; q->info._sifields._kill._uid = 0; } else copy_siginfo(&q->info, info); } Th e info a rg u m e n t p a s s e d t o t h e send_signal( ) fu n ct io n e it h e r p o in t s t o a p re vio u s ly b u ilt

siginfo_t t a b le o r s t o re s t h e co n s t a n t s 0 ( fo r a s ig n a l s e n t b y a Us e r Mo d e p ro ce s s ) o r 1 ( fo r a

s ig n a l s e n t b y a ke rn e l fu n ct io n ) . If it is n o t p o s s ib le t o a d d a n it e m t o t h e q u e u e , e it h e r b e ca u s e it a lre a d y in clu d e s max_queued_signals e le m e n t s o r b e ca u s e t h e re is n o fre e m e m o ry fo r t h e sigqueue d a t a s t ru ct u re , t h e s ig n a l o ccu rre n ce ca n n o t b e q u e u e d . If t h e s ig n a l is re a l- t im e a n d wa s s e n t t h ro u g h a s ys t e m ca ll t h a t is e xp licit ly re q u ire d t o q u e u e it ( like rt_sigqueueinfo( )) , t h e send_signal(

) fu n ct io n re t u rn s a n e rro r co d e . Ot h e rwis e , it s e t s t h e co rre s p o n d in g b it in t->pending.signal: if (sig >= 32 && info && (unsigned long)info != 1 && info->si_code != SI_USER) return -EAGAIN; sigaddset(&t->pending.signal, sig); return 0; It is im p o rt a n t t o le t t h e d e s t in a t io n p ro ce s s re ce ive t h e s ig n a l e ve n if t h e re is n o ro o m fo r t h e co rre s p o n d in g it e m in t h e p e n d in g s ig n a l q u e u e . S u p p o s e , fo r in s t a n ce , t h a t a p ro ce s s is co n s u m in g t o o m u ch m e m o ry. Th e ke rn e l m u s t e n s u re t h a t t h e kill( ) s ys t e m ca ll s u cce e d s e ve n if t h e re is n o fre e m e m o ry; o t h e rwis e , t h e s ys t e m a d m in is t ra t o r d o e s n 't h a ve a n y ch a n ce t o re co ve r t h e s ys t e m b y t e rm in a t in g t h e o ffe n d in g p ro ce s s . If t h e send_signal( ) fu n ct io n s u cce s s fu lly t e rm in a t e d a n d t h e s ig n a l is n o t b lo cke d , t h e d e s t in a t io n p ro ce s s h a s a n e w p e n d in g s ig n a l t o co n s id e r:

if (!retval && !sigismember(&t->blocked, sig)) signal_wake_up(t); Th e signal_wake_up( ) fu n ct io n p e rfo rm s t h re e a ct io n s :

1 . S e t s t h e sigpending fla g o f t h e d e s t in a t io n p ro ce s s .

2 . Ch e cks wh e t h e r t h e d e s t in a t io n p ro ce s s is a lre a d y ru n n in g o n a n o t h e r CPU a n d , in t h is ca s e , s e n d s a n in t e rp ro ce s s o r in t e rru p t t o t h a t CPU t o fo rce a re s ch e d u le o f t h e cu rre n t p ro ce s s ( s e e S e ct io n 4 . 6 . 2 ) . S in ce e a ch p ro ce s s ch e cks t h e e xis t e n ce o f p e n d in g s ig n a ls wh e n re t u rn in g fro m t h e schedule( ) fu n ct io n , t h e in t e rp ro ce s s o r in t e rru p t e n s u re s t h a t t h e d e s t in a t io n p ro ce s s q u ickly n o t ice s t h e n e w p e n d in g s ig n a l if it is a lre a d y ru n n in g . 3 . Ch e cks wh e t h e r t h e d e s t in a t io n p ro ce s s is in t h e TASK_INTERRUPTIBLE s t a t e a n d , in t h is ca s e , wa ke s it u p b y in vo kin g t h e wake_up_process( ).

Fin a lly, t h e send_sig_info( ) fu n ct io n re - e n a b le s t h e in t e rru p t s , re le a s e s t h e s p in lo ck, a n d t e rm in a t e s wit h t h e e rro r co d e o f send_signal( ):

spin_unlock_irqrestore(&t->sigmask_lock, flags); return retval; Th e send_sig( ) fu n ct io n is s im ila r t o send_sig_info( ). Ho we ve r, t h e info p a ra m e t e r is re p la ce d b y a priv fla g , wh ich is 1 if t h e s ig n a l is s e n t b y t h e ke rn e l a n d 0 if it is s e n t b y a p ro ce s s . Th e send_sig( ) fu n ct io n is im p le m e n t e d a s a s p e cia l ca s e o f send_sig_info( ):

return send_sig_info(sig, (void*)(long)(priv != 0), t); 10.2.2 The force_sig_info( ) and force_sig( ) Functions Th e force_sig_info(sig, info, t) fu n ct io n is u s e d b y t h e ke rn e l t o s e n d s ig n a ls t h a t ca n n o t

b e e xp licit ly ig n o re d o r b lo cke d b y t h e d e s t in a t io n p ro ce s s e s . Th e fu n ct io n 's p a ra m e t e rs a re t h e s a m e a s t h o s e o f send_sig_info( ). Th e force_sig_info( ) fu n ct io n a ct s o n t h e signal_struct d a t a s t ru ct u re t h a t is re fe re n ce d b y t h e sig fie ld in clu d e d in t h e d e s crip t o r t o f t h e d e s t in a t io n p ro ce s s :

spin_lock_irqsave(&t->sigmask_lock, flags); if (t->sig->action[sig-1].sa.sa_handler == SIG_IGN) t->sig->action[sig-1].sa.sa_handler = SIG_DFL; sigdelset(&t->blocked, sig); recalc_sigpending(t); spin_unlock_irqrestore(&t->sigmask_lock, flags); return send_sig_info(sig, info, t);

force_sig( ) is s im ila r t o force_sig_info( ). It s u s e is lim it e d t o s ig n a ls s e n t b y t h e ke rn e l; it ca n b e im p le m e n t e d a s a s p e cia l ca s e o f t h e force_sig_info( ) fu n ct io n : force_sig_info(sig, (void*)1L, t);

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

10.3 Delivering a Signal We a s s u m e t h a t t h e ke rn e l n o t ice d t h e a rriva l o f a s ig n a l a n d in vo ke d o n e o f t h e fu n ct io n s m e n t io n e d in t h e p re vio u s s e ct io n t o p re p a re t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s t h a t is s u p p o s e d t o re ce ive t h e s ig n a l. Bu t in ca s e t h a t p ro ce s s wa s n o t ru n n in g o n t h e CPU a t t h a t m o m e n t , t h e ke rn e l d e fe rre d t h e t a s k o f d e live rin g t h e s ig n a l. We n o w t u rn t o t h e a ct ivit ie s t h a t t h e ke rn e l p e rfo rm s t o e n s u re t h a t p e n d in g s ig n a ls o f a p ro ce s s a re h a n d le d . As m e n t io n e d in S e ct io n 4 . 8 , t h e ke rn e l ch e cks t h e va lu e o f t h e sigpending fla g o f t h e p ro ce s s d e s crip t o r b e fo re a llo win g t h e p ro ce s s t o re s u m e it s e xe cu t io n in Us e r Mo d e . Th u s , t h e ke rn e l ch e cks fo r t h e e xis t e n ce o f p e n d in g s ig n a ls e ve ry t im e it fin is h e s h a n d lin g a n in t e rru p t o r a n e xce p t io n . To h a n d le t h e n o n b lo cke d p e n d in g s ig n a ls , t h e ke rn e l in vo ke s t h e do_signal( ) fu n ct io n , wh ich re ce ive s t wo p a ra m e t e rs :

regs Th e a d d re s s o f t h e s t a ck a re a wh e re t h e Us e r Mo d e re g is t e r co n t e n t s o f t h e cu rre n t p ro ce s s a re s a ve d .

oldset Th e a d d re s s o f a va ria b le wh e re t h e fu n ct io n is s u p p o s e d t o s a ve t h e b it m a s k a rra y o f b lo cke d s ig n a ls . It is NULL if t h e re is n o n e e d t o s a ve t h e b it m a s k a rra y.

Th e do_signal( ) fu n ct io n s t a rt s b y ch e ckin g wh e t h e r t h e fu n ct io n it s e lf wa s t rig g e re d b y a n in t e rru p t ; if s o , it s im p ly re t u rn s . Ot h e rwis e , if t h e fu n ct io n wa s t rig g e re d b y a n e xce p t io n t h a t wa s ra is e d wh ile t h e p ro ce s s wa s ru n n in g in Us e r Mo d e , t h e fu n ct io n co n t in u e s e xe cu t in g :

if ((regs->xcs & 3) != 3) return 1; Ho we ve r, a s we 'll s e e in S e ct io n 1 0 . 3 . 4 , t h is d o e s n o t m e a n t h a t a s ys t e m ca ll ca n n o t b e in t e rru p t e d b y a s ig n a l. If t h e oldset p a ra m e t e r is NULL, t h e fu n ct io n in it ia lize s it wit h t h e a d d re s s o f t h e current-

>blocked fie ld : if (!oldset) oldset = ¤t->blocked; Th e h e a rt o f t h e do_signal( ) fu n ct io n co n s is t s o f a lo o p t h a t re p e a t e d ly in vo ke s t h e

dequeue_signal( ) fu n ct io n u n t il n o n o n b lo cke d p e n d in g s ig n a ls a re le ft . Th e re t u rn co d e o f dequeue_signal( ) is s t o re d in t h e signr lo ca l va ria b le . If it s va lu e is 0 , it m e a n s t h a t a ll p e n d in g s ig n a ls h a ve b e e n h a n d le d a n d do_signal( ) ca n fin is h . As lo n g a s a n o n ze ro

va lu e is re t u rn e d , a p e n d in g s ig n a l is wa it in g t o b e h a n d le d . dequeue_signal( ) is in vo ke d a g a in a ft e r do_signal( ) h a n d le s t h e cu rre n t s ig n a l.

Th e dequeue_signal( ) a lwa ys co n s id e rs t h e lo we s t - n u m b e re d p e n d in g s ig n a l. It u p d a t e s t h e d a t a s t ru ct u re s t o in d ica t e t h a t t h e s ig n a l is n o lo n g e r p e n d in g a n d re t u rn s it s n u m b e r. Th is t a s k in vo lve s cle a rin g t h e co rre s p o n d in g b it in current->pending.signal a n d u p d a t in g t h e va lu e o f current->sigpending. In t h e mask p a ra m e t e r, e a ch b it t h a t is s e t re p re s e n t s a b lo cke d s ig n a l:

sig = 0; if (((x = current->pending.signal.sig[0]) & ~mask->sig[0]) != 0) sig = 1 + ffz(~x); else if (((x = current->pending.signal.sig[1]) & ~mask->sig[1]) != 0) sig = 33 + ffz(~x); if (sig) { sigdelset(¤t->signal, sig); recalc_sigpending(current); } return sig; Th e co lle ct io n o f cu rre n t ly p e n d in g s ig n a ls is ANDe d wit h t h e b lo cke d s ig n a ls ( t h e co m p le m e n t o f mask) . If a n yt h in g is le ft , it re p re s e n t s a s ig n a l t h a t s h o u ld b e d e live re d t o t h e p ro ce s s . Th e ffz( ) fu n ct io n re t u rn s t h e in d e x o f t h e firs t b it in it s p a ra m e t e r; t h is va lu e is u s e d t o co m p u t e t h e lo we s t - n u m b e r s ig n a l t o b e d e live re d . Le t 's s e e h o w t h e do_signal( ) fu n ct io n h a n d le s a n y p e n d in g s ig n a l wh o s e n u m b e r is re t u rn e d b y dequeue_signal( ). Firs t , it ch e cks wh e t h e r t h e current re ce ive r p ro ce s s is b e in g m o n it o re d b y s o m e o t h e r p ro ce s s ; in t h is ca s e , do_signal( ) in vo ke s

notify_parent( ) a n d schedule( ) t o m a ke t h e m o n it o rin g p ro ce s s a wa re o f t h e s ig n a l h a n d lin g . Th e n do_signal( ) lo a d s t h e ka lo ca l va ria b le wit h t h e a d d re s s o f t h e k_sigaction d a t a s t ru ct u re o f t h e s ig n a l t o b e h a n d le d :

ka = ¤t->sig->action[signr-1]; De p e n d in g o n t h e co n t e n t s , t h re e kin d s o f a ct io n s m a y b e p e rfo rm e d : ig n o rin g t h e s ig n a l, e xe cu t in g a d e fa u lt a ct io n , o r e xe cu t in g a s ig n a l h a n d le r.

10.3.1 Ignoring the Signal Wh e n a d e live re d s ig n a l is e xp licit ly ig n o re d , t h e do_signal( ) fu n ct io n n o rm a lly ju s t co n t in u e s wit h a n e w e xe cu t io n o f t h e lo o p a n d t h e re fo re co n s id e rs a n o t h e r p e n d in g s ig n a l. On e e xce p t io n e xis t s , a s d e s crib e d e a rlie r:

if (ka->sa.sa_handler == SIG_IGN) { if (signr == SIGCHLD) while (sys_wait4(-1, NULL, WNOHANG, NULL) > 0) /* nothing */; continue; }

If t h e s ig n a l d e live re d is SIGCHLD, t h e sys_wait4( ) s e rvice ro u t in e o f t h e wait4( ) s ys t e m ca ll is in vo ke d t o fo rce t h e p ro ce s s t o re a d in fo rm a t io n a b o u t it s ch ild re n , t h u s cle a n in g u p m e m o ry le ft o ve r b y t h e t e rm in a t e d ch ild p ro ce s s e s ( s e e S e ct io n 3 . 5 ) .

10.3.2 Executing the Default Action for the Signal If ka->sa.sa_handler is e q u a l t o SIG_DFL, do_signal( ) m u s t p e rfo rm t h e d e fa u lt a ct io n o f t h e s ig n a l. Th e o n ly e xce p t io n co m e s wh e n t h e re ce ivin g p ro ce s s is in it , in wh ich ca s e t h e s ig n a l is d is ca rd e d a s d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 0 . 1 . 1 :

if (current->pid == 1) continue; Fo r o t h e r p ro ce s s e s , s in ce t h e d e fa u lt a ct io n d e p e n d s o n t h e t yp e o f s ig n a l, t h e fu n ct io n e xe cu t e s a switch s t a t e m e n t b a s e d o n t h e va lu e o f signr.

Th e s ig n a ls wh o s e d e fa u lt a ct io n is "ig n o re " a re e a s ily h a n d le d :

case SIGCONT: case SIGCHLD: case SIGWINCH: continue; Th e s ig n a ls wh o s e d e fa u lt a ct io n is "s t o p " m a y s t o p t h e cu rre n t p ro ce s s . To d o t h is , do_signal( ) s e t s t h e s t a t e o f current t o TASK_STOPPED a n d t h e n in vo ke s t h e

schedule( ) fu n ct io n ( s e e S e ct io n 1 1 . 2 . 2 ) . Th e do_signal( ) fu n ct io n a ls o s e n d s a SIGCHLD s ig n a l t o t h e p a re n t p ro ce s s o f current, u n le s s t h e p a re n t h a s s e t t h e SA_NOCLDSTOP fla g o f SIGCHLD: case SIGTSTP: case SIGTTIN: case SIGTTOU: if (is_orphaned_pgrp(current->pgrp)) continue; case SIGSTOP: current->state = TASK_STOPPED; current->exit_code = signr; if (current->p_pptr->sig && !(SA_NOCLDSTOP & current->p_pptr->sig->action[SIGCHLD-1].sa.sa_flags)) notify_parent(current, SIGCHLD); schedule( ); continue; Th e d iffe re n ce b e t we e n SIGSTOP a n d t h e o t h e r s ig n a ls is s u b t le : SIGSTOP a lwa ys s t o p s t h e p ro ce s s , wh ile t h e o t h e r s ig n a ls s t o p t h e p ro ce s s o n ly if it is n o t in a n "o rp h a n e d p ro ce s s g ro u p . " Th e POS IX s t a n d a rd s p e cifie s t h a t a p ro ce s s g ro u p is n o t o rp h a n e d a s lo n g a s t h e re is a p ro ce s s in t h e g ro u p t h a t h a s a p a re n t in a d iffe re n t p ro ce s s g ro u p b u t in t h e s a m e s e s s io n . Th e s ig n a ls wh o s e d e fa u lt a ct io n is "d u m p " m a y cre a t e a core file in t h e p ro ce s s wo rkin g d ire ct o ry; t h is file lis t s t h e co m p le t e co n t e n t s o f t h e p ro ce s s 's a d d re s s s p a ce a n d CPU re g is t e rs . Aft e r t h e do_signal( ) cre a t e s t h e co re file , it kills t h e p ro ce s s . Th e d e fa u lt a ct io n o f t h e re m a in in g 1 8 s ig n a ls is "t e rm in a t e , " wh ich co n s is t s o f ju s t killin g t h e p ro ce s s :

exit_code = sig_nr; case SIGQUIT: case SIGILL: case SIGTRAP: case SIGABRT: case SIGFPE: case SIGSEGV: case SIGBUS: case SIGSYS: case SIGXCPU: case SIGXFSZ: if (do_coredump(signr, regs)) exit_code |= 0x80; default: sigaddset(¤t->pending.signal, signr); recalc_sigpending(current); current->flags |= PF_SIGNALED; do_exit(exit_code); Th e do_exit( ) fu n ct io n re ce ive s a s it s in p u t p a ra m e t e r t h e s ig n a l n u m b e r ORe d wit h a fla g s e t wh e n a co re d u m p h a s b e e n p e rfo rm e d . Th a t va lu e is u s e d t o s e t t h e e xit co d e o f t h e p ro ce s s . Th e fu n ct io n t e rm in a t e s t h e cu rre n t p ro ce s s , a n d h e n ce n e ve r re t u rn s ( s e e Ch a p t e r 2 0 ) .

10.3.3 Catching the Signal If a h a n d le r h a s b e e n e s t a b lis h e d fo r t h e s ig n a l, t h e do_signal( ) fu n ct io n m u s t e n fo rce it s e xe cu t io n . It d o e s t h is b y in vo kin g handle_signal( ):

handle_signal(signr, ka, &info, oldset, regs); return 1; No t ice h o w do_signal( ) re t u rn s a ft e r h a vin g h a n d le d a s in g le s ig n a l. Ot h e r p e n d in g s ig n a ls wo n 't b e co n s id e re d u n t il t h e n e xt in vo ca t io n o f do_signal( ). Th is a p p ro a ch e n s u re s t h a t re a l- t im e s ig n a ls will b e d e a lt wit h in t h e p ro p e r o rd e r. Exe cu t in g a s ig n a l h a n d le r is a ra t h e r co m p le x t a s k b e ca u s e o f t h e n e e d t o ju g g le s t a cks ca re fu lly wh ile s wit ch in g b e t we e n Us e r Mo d e a n d Ke rn e l Mo d e . We e xp la in e xa ct ly wh a t is e n t a ile d h e re . S ig n a l h a n d le rs a re fu n ct io n s d e fin e d b y Us e r Mo d e p ro ce s s e s a n d in clu d e d in t h e Us e r Mo d e co d e s e g m e n t . Th e handle_signal( ) fu n ct io n ru n s in Ke rn e l Mo d e wh ile s ig n a l h a n d le rs ru n in Us e r Mo d e ; t h is m e a n s t h a t t h e cu rre n t p ro ce s s m u s t firs t e xe cu t e t h e s ig n a l h a n d le r in Us e r Mo d e b e fo re b e in g a llo we d t o re s u m e it s "n o rm a l" e xe cu t io n . Mo re o ve r, wh e n t h e ke rn e l a t t e m p t s t o re s u m e t h e n o rm a l e xe cu t io n o f t h e p ro ce s s , t h e Ke rn e l Mo d e s t a ck n o lo n g e r co n t a in s t h e h a rd wa re co n t e xt o f t h e in t e rru p t e d p ro g ra m b e ca u s e t h e Ke rn e l Mo d e s t a ck is e m p t ie d a t e ve ry t ra n s it io n fro m Us e r Mo d e t o Ke rn e l Mo d e . An a d d it io n a l co m p lica t io n is t h a t s ig n a l h a n d le rs m a y in vo ke s ys t e m ca lls . In t h is ca s e , a ft e r t h e s e rvice ro u t in e e xe cu t e s , co n t ro l m u s t b e re t u rn e d t o t h e s ig n a l h a n d le r in s t e a d o f t o t h e co d e o f t h e in t e rru p t e d p ro g ra m . Th e s o lu t io n a d o p t e d in Lin u x co n s is t s o f co p yin g t h e h a rd wa re co n t e xt s a ve d in t h e Ke rn e l Mo d e s t a ck o n t o t h e Us e r Mo d e s t a ck o f t h e cu rre n t p ro ce s s . Th e Us e r Mo d e s t a ck is a ls o m o d ifie d in s u ch a wa y t h a t , wh e n t h e s ig n a l h a n d le r t e rm in a t e s , t h e sigreturn( ) s ys t e m ca ll is a u t o m a t ica lly in vo ke d t o co p y t h e h a rd wa re co n t e xt b a ck o n t h e Ke rn e l Mo d e s t a ck a n d re s t o re t h e o rig in a l co n t e n t o f t h e Us e r Mo d e s t a ck. Fig u re 1 0 - 2 illu s t ra t e s t h e flo w o f e xe cu t io n o f t h e fu n ct io n s in vo lve d in ca t ch in g a s ig n a l. A

n o n b lo cke d s ig n a l is s e n t t o a p ro ce s s . Wh e n a n in t e rru p t o r e xce p t io n o ccu rs , t h e p ro ce s s s wit ch e s in t o Ke rn e l Mo d e . Rig h t b e fo re re t u rn in g t o Us e r Mo d e , t h e ke rn e l e xe cu t e s t h e do_signal( ) fu n ct io n , wh ich in t u rn h a n d le s t h e s ig n a l ( b y in vo kin g handle_signal(

)) a n d s e t s u p t h e Us e r Mo d e s t a ck ( b y in vo kin g setup_frame( ) o r setup_rt_frame( )) . Wh e n t h e p ro ce s s s wit ch e s a g a in t o Us e r Mo d e , it s t a rt s e xe cu t in g t h e s ig n a l h a n d le r b e ca u s e t h e h a n d le r's s t a rt in g a d d re s s wa s fo rce d in t o t h e p ro g ra m co u n t e r. Wh e n t h a t fu n ct io n t e rm in a t e s , t h e re t u rn co d e p la ce d o n t h e Us e r Mo d e s t a ck b y t h e setup_frame(

) o r setup_rt_frame( ) fu n ct io n is e xe cu t e d . Th is co d e in vo ke s t h e sigreturn( ) s ys t e m ca ll, wh o s e s e rvice ro u t in e co p ie s t h e h a rd wa re co n t e xt o f t h e n o rm a l p ro g ra m in t h e Ke rn e l Mo d e s t a ck a n d re s t o re s t h e Us e r Mo d e s t a ck b a ck t o it s o rig in a l s t a t e ( b y in vo kin g restore_sigcontext( )) . Wh e n t h e s ys t e m ca ll t e rm in a t e s , t h e n o rm a l p ro g ra m ca n t h u s re s u m e it s e xe cu t io n . Fig u re 1 0 - 2 . Ca t c h in g a s ig n a l

Le t 's n o w e xa m in e in d e t a il h o w t h is s ch e m e is ca rrie d o u t .

10.3.3.1 Setting up the frame To p ro p e rly s e t t h e Us e r Mo d e s t a ck o f t h e p ro ce s s , t h e handle_signal( ) fu n ct io n in vo ke s e it h e r setup_frame( ) ( fo r s ig n a ls t h a t d o n o t re q u ire a siginfo_t t a b le ; s e e S e ct io n 1 0 . 4 la t e r in t h is ch a p t e r) o r setup_rt_frame( ) ( fo r s ig n a ls t h a t d o re q u ire a

siginfo_t t a b le ) . To ch o o s e a m o n g t h e s e t wo fu n ct io n s , t h e ke rn e l ch e cks t h e va lu e o f t h e SA_SIGINFO fla g in t h e sa_flags fie ld o f t h e sigaction t a b le a s s o cia t e d wit h t h e s ig n a l. Th e setup_frame( ) fu n ct io n re ce ive s fo u r p a ra m e t e rs , wh ich h a ve t h e fo llo win g m e a n in g s :

sig S ig n a l n u m b e r

ka

Ad d re s s o f t h e k_sigaction t a b le a s s o cia t e d wit h t h e s ig n a l

oldset Ad d re s s o f a b it m a s k a rra y o f b lo cke d s ig n a ls

regs Ad d re s s in t h e Ke rn e l Mo d e s t a ck a re a wh e re t h e Us e r Mo d e re g is t e r co n t e n t s a re s a ve d Th e setup_frame( ) fu n ct io n p u s h e s o n t o t h e Us e r Mo d e s t a ck a d a t a s t ru ct u re ca lle d a fra m e , wh ich co n t a in s t h e in fo rm a t io n n e e d e d t o h a n d le t h e s ig n a l a n d t o e n s u re t h e co rre ct re t u rn t o t h e sys_sigreturn( ) fu n ct io n . A fra m e is a sigframe t a b le t h a t in clu d e s t h e fo llo win g fie ld s ( s e e Fig u re 1 0 - 3 ) :

pretcode Re t u rn a d d re s s o f t h e s ig n a l h a n d le r fu n ct io n ; it p o in t s t o t h e retcode fie ld ( la t e r in t h is lis t ) in t h e s a m e t a b le .

sig Th e s ig n a l n u m b e r; t h is is t h e p a ra m e t e r re q u ire d b y t h e s ig n a l h a n d le r.

sc St ru ct u re o f t yp e sigcontext co n t a in in g t h e h a rd wa re co n t e xt o f t h e Us e r Mo d e p ro ce s s rig h t b e fo re s wit ch in g t o Ke rn e l Mo d e ( t h is in fo rm a t io n is co p ie d fro m t h e Ke rn e l Mo d e s t a ck o f current) . It a ls o co n t a in s a b it a rra y t h a t s p e cifie s t h e b lo cke d re g u la r s ig n a ls o f t h e p ro ce s s .

fpstate S t ru ct u re o f t yp e _fpstate t h a t m a y b e u s e d t o s t o re t h e flo a t in g p o in t re g is t e rs o f t h e Us e r Mo d e p ro ce s s ( s e e S e ct io n 3 . 3 . 4 ) .

extramask Bit a rra y t h a t s p e cifie s t h e b lo cke d re a l- t im e s ig n a ls .

retcode Eig h t - b yt e co d e is s u in g a sigreturn( ) s ys t e m ca ll; t h is co d e is e xe cu t e d wh e n re t u rn in g fro m t h e s ig n a l h a n d le r. Fig u re 1 0 - 3 . Fra m e o n t h e Us e r Mo d e s t a c k

Th e setup_frame( ) fu n ct io n s t a rt s b y in vo kin g get_sigframe( ) t o co m p u t e t h e firs t m e m o ry lo ca t io n o f t h e fra m e . Th a t m e m o ry lo ca t io n is u s u a lly [ 4 ] in t h e Us e r Mo d e s t a ck, s o t h e fu n ct io n re t u rn s t h e va lu e : [4]

Lin u x a llo ws p ro ce s s e s t o s p e cify a n a lt e rn a t e s t a ck fo r t h e ir s ig n a l h a n d le rs b y in vo kin g t h e sigaltstack( ) s ys t e m ca ll; t h is fe a t u re is a ls o re q u e s t e d b y t h e X/ Op e n s t a n d a rd . Wh e n a n a lt e rn a t e s t a ck is p re s e n t , t h e get_sigframe( ) fu n ct io n re t u rn s a n a d d re s s in s id e t h a t s t a ck. We d o n 't d is cu s s t h is fe a t u re fu rt h e r, s in ce it is co n ce p t u a lly s im ila r t o re g u la r s ig n a l h a n d lin g . (regs->esp - sizeof(struct sigframe)) & 0xfffffff8 S in ce s t a cks g ro w t o wa rd lo we r a d d re s s e s , t h e in it ia l a d d re s s o f t h e fra m e is o b t a in e d b y s u b t ra ct in g it s s ize fro m t h e a d d re s s o f t h e cu rre n t s t a ck t o p a n d a lig n in g t h e re s u lt t o a m u lt ip le o f 8 . Th e re t u rn e d a d d re s s is t h e n ve rifie d b y m e a n s o f t h e access_ok m a cro ; if it is va lid , t h e fu n ct io n re p e a t e d ly in vo ke s _ _put_user( ) t o fill a ll t h e fie ld s o f t h e fra m e . On ce t h is is d o n e , it m o d ifie s t h e regs a re a o f t h e Ke rn e l Mo d e s t a ck, t h u s e n s u rin g t h a t co n t ro l is t ra n s fe rre d t o t h e s ig n a l h a n d le r wh e n current re s u m e s it s e xe cu t io n in Us e r Mo d e :

regs->esp = (unsigned long) frame; regs->eip = (unsigned long) ka->sa.sa_handler; Th e setup_frame( ) fu n ct io n t e rm in a t e s b y re s e t t in g t h e s e g m e n t a t io n re g is t e rs s a ve d o n t h e Ke rn e l Mo d e s t a ck t o t h e ir d e fa u lt va lu e . No w t h e in fo rm a t io n n e e d e d b y t h e s ig n a l h a n d le r is o n t h e t o p o f t h e Us e r Mo d e s t a ck. Th e setup_rt_frame( ) fu n ct io n is ve ry s im ila r t o setup_frame( ), b u t it p u t s o n t h e Us e r Mo d e s t a ck a n e x t e n d e d fra m e ( s t o re d in t h e rt_sigframe d a t a s t ru ct u re ) t h a t a ls o in clu d e s t h e co n t e n t o f t h e siginfo_t t a b le a s s o cia t e d wit h t h e s ig n a l.

10.3.3.2 Evaluating the signal flags

Aft e r s e t t in g u p t h e Us e r Mo d e s t a ck, t h e handle_signal( ) fu n ct io n ch e cks t h e va lu e s o f t h e fla g s a s s o cia t e d wit h t h e s ig n a l. If t h e re ce ive d s ig n a l h a s t h e SA_ONESHOT fla g s e t , it m u s t b e re s e t t o it s d e fa u lt a ct io n s o t h a t fu rt h e r o ccu rre n ce s o f t h e s a m e s ig n a l will n o t t rig g e r t h e e xe cu t io n o f t h e s ig n a l h a n d le r:

if (ka->sa.sa_flags & SA_ONESHOT) ka->sa.sa_handler = SIG_DFL; Mo re o ve r, if t h e s ig n a l d o e s n o t h a ve t h e SA_NODEFER fla g s e t , t h e s ig n a ls in t h e sa_mask fie ld o f t h e sigaction t a b le m u s t b e b lo cke d d u rin g t h e e xe cu t io n o f t h e s ig n a l h a n d le r:

if (!(ka->sa.sa_flags & SA_NODEFER)) { spin_lock_irq(¤t->sigmask_lock); sigorsets(¤t->blocked, ¤t->blocked, &ka->sa.sa_mask); sigaddset(¤t->blocked, sig); recalc_sigpending(current); spin_unlock_irq(¤t->sigmask_lock); } As d e s crib e d e a rlie r, t h e recalc_sigpending( ) fu n ct io n ch e cks wh e t h e r t h e p ro ce s s h a s n o n b lo cke d p e n d in g s ig n a ls a n d s e t s it s sigpending fie ld a cco rd in g ly.

Th e fu n ct io n re t u rn s t h e n t o do_signal( ), wh ich a ls o re t u rn s im m e d ia t e ly.

10.3.3.3 Starting the signal handler Wh e n do_signal( ) re t u rn s , t h e cu rre n t p ro ce s s re s u m e s it s e xe cu t io n in Us e r Mo d e . Be ca u s e o f t h e p re p a ra t io n b y setup_frame( ) d e s crib e d e a rlie r, t h e eip re g is t e r p o in t s t o t h e firs t in s t ru ct io n o f t h e s ig n a l h a n d le r, wh ile esp p o in t s t o t h e firs t m e m o ry lo ca t io n o f t h e fra m e t h a t h a s b e e n p u s h e d o n t o p o f t h e Us e r Mo d e s t a ck. As a re s u lt , t h e s ig n a l h a n d le r is e xe cu t e d .

10.3.3.4 Terminating the signal handler Wh e n t h e s ig n a l h a n d le r t e rm in a t e s , t h e re t u rn a d d re s s o n t o p o f t h e s t a ck p o in t s t o t h e co d e in t h e retcode fie ld o f t h e fra m e . Fo r s ig n a ls wit h o u t siginfo_t t a b le , t h e co d e is e q u iva le n t t o t h e fo llo win g a s s e m b ly la n g u a g e in s t ru ct io n s :

popl %eax movl $_ _NR_sigreturn, %eax int $0x80 Th e re fo re , t h e s ig n a l n u m b e r ( t h a t is , t h e sig fie ld o f t h e fra m e ) is d is ca rd e d fro m t h e s t a ck, a n d t h e sigreturn( ) s ys t e m ca ll is t h e n in vo ke d .

Th e sys_sigreturn( ) fu n ct io n co m p u t e s t h e a d d re s s o f t h e pt_regs d a t a s t ru ct u re

regs, wh ich co n t a in s t h e h a rd wa re co n t e xt o f t h e Us e r Mo d e p ro ce s s ( s e e S e ct io n 9 . 2 . 3 ) .

Fro m t h e va lu e s t o re d in t h e esp fie ld , it ca n t h u s d e rive a n d ch e ck t h e fra m e a d d re s s in s id e t h e Us e r Mo d e s t a ck:

frame = (struct sigframe *)(regs.esp - 8); if (verify_area(VERIFY_READ, frame, sizeof(*frame)) { force_sig(SIGSEGV, current); return 0; } Th e n t h e fu n ct io n co p ie s t h e b it a rra y o f s ig n a ls t h a t we re b lo cke d b e fo re in vo kin g t h e s ig n a l h a n d le r fro m t h e sc fie ld o f t h e fra m e t o t h e blocked fie ld o f current. As a re s u lt , a ll s ig n a ls t h a t h a ve b e e n m a s ke d fo r t h e e xe cu t io n o f t h e s ig n a l h a n d le r a re u n b lo cke d . Th e recalc_sigpending( ) fu n ct io n is t h e n in vo ke d .

Th e sys_sigreturn( ) fu n ct io n m u s t a t t h is p o in t co p y t h e p ro ce s s h a rd wa re co n t e xt fro m t h e sc fie ld o f t h e fra m e t o t h e Ke rn e l Mo d e s t a ck a n d re m o ve t h e fra m e fro m t h e Us e r Mo d e s t a ck; it p e rfo rm s t h e s e t wo t a s ks b y in vo kin g t h e restore_sigcontext( ) fu n ct io n . If t h e s ig n a l wa s s e n t b y a s ys t e m ca ll like rt_sigqueueinfo( ) t h a t re q u ire d a

siginfo_t t a b le t o b e a s s o cia t e d t o t h e s ig n a l, t h e m e ch a n is m is ve ry s im ila r. Th e re t u rn co d e in t h e retcode fie ld o f t h e e xt e n d e d fra m e in vo ke s t h e rt_sigreturn( ) s ys t e m ca ll; t h e co rre s p o n d in g sys_rt_sigreturn( ) s e rvice ro u t in e co p ie s t h e p ro ce s s h a rd wa re co n t e xt fro m t h e e xt e n d e d fra m e t o t h e Ke rn e l Mo d e s t a ck a n d re s t o re s t h e o rig in a l Us e r Mo d e s t a ck co n t e n t b y re m o vin g t h e e xt e n d e d fra m e fro m it .

10.3.4 Reexecution of System Calls Th e re q u e s t a s s o cia t e d wit h a s ys t e m ca ll ca n n o t a lwa ys b e im m e d ia t e ly s a t is fie d b y t h e ke rn e l; wh e n t h is h a p p e n s , t h e p ro ce s s t h a t is s u e d t h e s ys t e m ca ll is p u t in a TASK_INTERRUPTIBLE o r TASK_UNINTERRUPTIBLE s t a t e .

If t h e p ro ce s s is p u t in a TASK_INTERRUPTIBLE s t a t e a n d s o m e o t h e r p ro ce s s s e n d s a s ig n a l t o it , t h e ke rn e l p u t s it in t h e TASK_RUNNING s t a t e wit h o u t co m p le t in g t h e s ys t e m ca ll ( s e e S e ct io n 4 . 8 ) . Wh e n t h is h a p p e n s , t h e s ys t e m ca ll s e rvice ro u t in e d o e s n o t co m p le t e it s jo b , b u t re t u rn s a n EINTR, ERESTARTNOHAND, ERESTARTSYS, o r ERESTARTNOINTR e rro r co d e . Th e s ig n a l is d e live re d t o t h e p ro ce s s wh ile s wit ch in g b a ck t o Us e r Mo d e . In p ra ct ice , t h e o n ly e rro r co d e a Us e r Mo d e p ro ce s s ca n g e t in t h is s it u a t io n is EINTR, wh ich m e a n s t h a t t h e s ys t e m ca ll h a s n o t b e e n co m p le t e d . ( Th e a p p lica t io n p ro g ra m m e r m a y ch e ck t h is co d e a n d d e cid e wh e t h e r t o re is s u e t h e s ys t e m ca ll. ) Th e re m a in in g e rro r co d e s a re u s e d in t e rn a lly b y t h e ke rn e l t o s p e cify wh e t h e r t h e s ys t e m ca ll m a y b e re e xe cu t e d a u t o m a t ica lly a ft e r t h e s ig n a l h a n d le r t e rm in a t io n . Ta b le 1 0 - 6 lis t s t h e e rro r co d e s re la t e d t o u n fin is h e d s ys t e m ca lls a n d t h e ir im p a ct fo r e a ch o f t h e t h re e p o s s ib le s ig n a l a ct io n s . Th e t e rm s t h a t a p p e a r in t h e e n t rie s a re d e fin e d in t h e fo llo win g lis t : Te rm in a t e

Th e s ys t e m ca ll will n o t b e a u t o m a t ica lly re e xe cu t e d ; t h e p ro ce s s will re s u m e it s e xe cu t io n in Us e r Mo d e a t t h e in s t ru ct io n fo llo win g t h e int $0x80 o n e a n d t h e eax re g is t e r will co n t a in t h e -EINTR va lu e .

Re e x e cu t e Th e ke rn e l fo rce s t h e Us e r Mo d e p ro ce s s t o re lo a d t h e eax re g is t e r wit h t h e s ys t e m ca ll n u m b e r a n d t o re e xe cu t e t h e int $0x80 in s t ru ct io n ; t h e p ro ce s s is n o t a wa re o f t h e re e xe cu t io n a n d t h e e rro r co d e is n o t p a s s e d t o it . De p e n d s Th e s ys t e m ca ll is re e xe cu t e d o n ly if t h e SA_RESTART fla g o f t h e d e live re d s ig n a l is s e t ; o t h e rwis e , t h e s ys t e m ca ll t e rm in a t e s wit h a -EINTR e rro r co d e .

Ta b le 1 0 - 6 . Re e x e c u t io n o f s y s t e m c a lls

S ig n a l Ac t io n

Erro r c o d e s a n d t h e ir im p a c t o n s y s t e m c a ll e x e c u t io n

EI N TR

ERES TARTS YS

ERES TARTN OHAN D

ERES TARTN OI N TR

De fa u lt

Te rm in a t e Re e xe cu t e

Re e xe cu t e

Re e xe cu t e

Ig n o re

Te rm in a t e Re e xe cu t e

Re e xe cu t e

Re e xe cu t e

Ca t ch

Te rm in a t e De p e n d s

Te rm in a t e

Re e xe cu t e

Wh e n d e live rin g a s ig n a l, t h e ke rn e l m u s t b e s u re t h a t t h e p ro ce s s re a lly is s u e d a s ys t e m ca ll b e fo re a t t e m p t in g t o re e xe cu t e it . Th is is wh e re t h e orig_eax fie ld o f t h e regs h a rd wa re co n t e xt p la ys a crit ica l ro le . Le t 's re ca ll h o w t h is fie ld is in it ia lize d wh e n t h e in t e rru p t o r e xce p t io n h a n d le r s t a rt s : In t e rru p t Th e fie ld co n t a in s t h e IRQ n u m b e r a s s o cia t e d wit h t h e in t e rru p t m in u s 2 5 6 ( s e e S e ct io n 4 . 6 . 1 . 4 ) .

0x80 e x ce p t io n Th e fie ld co n t a in s t h e s ys t e m ca ll n u m b e r ( s e e S e ct io n 9 . 2 . 2 ) . Ot h e r e x ce p t io n s Th e fie ld co n t a in s t h e va lu e - 1 ( s e e S e ct io n 4 . 5 . 1 ) .

Th e re fo re , a n o n - n e g a t ive va lu e in t h e orig_eax fie ld m e a n s t h a t t h e s ig n a l h a s wo ke n u p a TASK_INTERRUPTIBLE p ro ce s s t h a t wa s s le e p in g in a s ys t e m ca ll. Th e s e rvice ro u t in e re co g n ize s t h a t t h e s ys t e m ca ll wa s in t e rru p t e d , a n d t h u s re t u rn s o n e o f t h e p re vio u s ly m e n t io n e d e rro r co d e s . If t h e s ig n a l is e xp licit ly ig n o re d o r if it s d e fa u lt a ct io n is e n fo rce d , do_signal( ) a n a lyze s t h e e rro r co d e o f t h e s ys t e m ca ll t o d e cid e wh e t h e r t h e u n fin is h e d s ys t e m ca ll m u s t b e a u t o m a t ica lly re e xe cu t e d , a s s p e cifie d in Ta b le 1 0 - 6 . If t h e ca ll m u s t b e re s t a rt e d , t h e fu n ct io n m o d ifie s t h e regs h a rd wa re co n t e xt s o t h a t , wh e n t h e p ro ce s s is b a ck in Us e r Mo d e , eip p o in t s t o t h e int $0x80 in s t ru ct io n a n d eax co n t a in s t h e s ys t e m ca ll n u m b e r:

if (regs->orig_eax >= 0) { if (regs->eax == -ERESTARTNOHAND || regs->eax == -ERESTARTSYS || regs->eax == -ERESTARTNOINTR) { regs->eax = regs->orig_eax; regs->eip -= 2; } } Th e regs->eax fie ld is fille d wit h t h e re t u rn co d e o f a s ys t e m ca ll s e rvice ro u t in e ( s e e S e ct io n 9 . 2 . 2 ) . If t h e s ig n a l is ca u g h t , handle_signal( ) a n a lyze s t h e e rro r co d e a n d , p o s s ib ly, t h e

SA_RESTART fla g o f t h e sigaction t a b le t o d e cid e wh e t h e r t h e u n fin is h e d s ys t e m ca ll m u s t b e re e xe cu t e d :

if (regs->orig_eax >= 0) { switch (regs->eax) { case -ERESTARTNOHAND: regs->eax = -EINTR; break; case -ERESTARTSYS: if (!(ka->sa.sa_flags & SA_RESTART)) { regs->eax = -EINTR; break; } /* fallthrough */ case -ERESTARTNOINTR: regs->eax = regs->orig_eax; regs->eip -= 2; } } If t h e s ys t e m ca ll m u s t b e re s t a rt e d , handle_signal( ) p ro ce e d s e xa ct ly a s do_signal(

); o t h e rwis e , it re t u rn s a n -EINTR e rro r co d e t o t h e Us e r Mo d e p ro ce s s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

10.4 System Calls Related to Signal Handling As s t a t e d in t h e in t ro d u ct io n o f t h is ch a p t e r, p ro g ra m s ru n n in g in Us e r Mo d e a re a llo we d t o s e n d a n d re ce ive s ig n a ls . Th is m e a n s t h a t a s e t o f s ys t e m ca lls m u s t b e d e fin e d t o a llo w t h e s e kin d s o f o p e ra t io n s . Un fo rt u n a t e ly, fo r h is t o rica l re a s o n s , s e ve ra l s ys t e m ca lls e xis t t h a t s e rve e s s e n t ia lly t h e s a m e p u rp o s e . As a re s u lt , s o m e o f t h e s e s ys t e m ca lls a re n e ve r in vo ke d . Fo r in s t a n ce , sys_sigaction( ) a n d sys_rt_sigaction( ) a re a lm o s t id e n t ica l, s o t h e sigaction( ) wra p p e r fu n ct io n in clu d e d in t h e C lib ra ry e n d s u p in vo kin g

sys_rt_sigaction( ) in s t e a d o f sys_sigaction( ). We s h a ll d e s crib e s o m e o f t h e m o s t s ig n ifica n t POS IX s ys t e m ca lls .

10.4.1 The kill( ) System Call Th e kill(pid,sig) s ys t e m ca ll is co m m o n ly u s e d t o s e n d s ig n a ls ; it s co rre s p o n d in g s e rvice ro u t in e is t h e sys_kill( ) fu n ct io n . Th e in t e g e r pid p a ra m e t e r h a s s e ve ra l m e a n in g s , d e p e n d in g o n it s n u m e rica l va lu e : p id > 0 Th e sig s ig n a l is s e n t t o t h e p ro ce s s wh o s e PID is e q u a l t o pid.

p id = 0 Th e sig s ig n a l is s e n t t o a ll p ro ce s s e s in t h e s a m e g ro u p a s t h e ca llin g p ro ce s s .

p id = - 1 Th e s ig n a l is s e n t t o a ll p ro ce s s e s , e xce p t s w a p p e r ( PID 0 ) , in it ( PID 1 ) , a n d current.

p id < - 1 Th e s ig n a l is s e n t t o a ll p ro ce s s e s in t h e p ro ce s s g ro u p - p id . Th e sys_kill( ) fu n ct io n s e t s u p a m in im a l siginfo_t t a b le fo r t h e s ig n a l, a n d t h e n in vo ke s kill_something_info( ):

info.si_signo = sig; info.si_errno = 0; info.si_code = SI_USER; info._sifields._kill._pid = current->pid; info._sifields._kill._uid = current->uid; return kill_something_info(sig, &info, pid); Th e kill_something_info( ) fu n ct io n , in t u rn , in vo ke s e it h e r send_sig_info( ) ( t o s e n d t h e s ig n a l t o a s in g le p ro ce s s ) , o r kill_pg_info( ) ( t o s ca n a ll p ro ce s s e s a n d in vo ke

send_sig_info( ) fo r e a ch p ro ce s s in t h e d e s t in a t io n g ro u p ) .

Th e kill( ) s ys t e m ca ll is a b le t o s e n d a n y s ig n a l, e ve n t h e s o - ca lle d re a l- t im e s ig n a ls t h a t h a ve n u m b e rs ra n g in g fro m 3 2 t o 6 3 . Ho we ve r, a s we s a w in t h e e a rlie r s e ct io n S e ct io n 1 0 . 2 , t h e kill( ) s ys t e m ca ll d o e s n o t e n s u re t h a t a n e w e le m e n t is a d d e d t o t h e p e n d in g s ig n a l q u e u e o f t h e d e s t in a t io n p ro ce s s , t h u s m u lt ip le in s t a n ce s o f p e n d in g s ig n a ls ca n b e lo s t . Re a l- t im e s ig n a ls s h o u ld b e s e n t b y m e a n s o f a s ys t e m ca ll like rt_sigqueueinfo( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 0 . 4 . 6 ) . S ys t e m V a n d BS D Un ix va ria n t s a ls o h a ve a killpg( ) s ys t e m ca ll, wh ich is a b le t o e xp licit ly s e n d a s ig n a l t o a g ro u p o f p ro ce s s e s . In Lin u x, t h e fu n ct io n is im p le m e n t e d a s a lib ra ry fu n ct io n t h a t u s e s t h e kill( ) s ys t e m ca ll. An o t h e r va ria t io n is raise( ), wh ich s e n d s a s ig n a l t o t h e cu rre n t p ro ce s s ( t h a t is , t o t h e p ro ce s s e xe cu t in g t h e fu n ct io n ) . In Lin u x, raise() is im p le m e n t e d a s a lib ra ry fu n ct io n .

10.4.2 Changing a Signal Action Th e sigaction(sig,act,oact) s ys t e m ca ll a llo ws u s e rs t o s p e cify a n a ct io n fo r a s ig n a l; o f co u rs e , if n o s ig n a l a ct io n is d e fin e d , t h e ke rn e l e xe cu t e s t h e d e fa u lt a ct io n a s s o cia t e d wit h t h e d e live re d s ig n a l. Th e co rre s p o n d in g sys_sigaction( ) s e rvice ro u t in e a ct s o n t wo p a ra m e t e rs : t h e sig s ig n a l n u m b e r a n d t h e act t a b le o f t yp e sigaction t h a t s p e cifie s t h e n e w a ct io n . A t h ird

oact o p t io n a l o u t p u t p a ra m e t e r m a y b e u s e d t o g e t t h e p re vio u s a ct io n a s s o cia t e d wit h t h e s ig n a l. Th e fu n ct io n ch e cks firs t wh e t h e r t h e act a d d re s s is va lid . Th e n it fills t h e sa_handler,

sa_flags, a n d sa_mask fie ld s o f a new_ka lo ca l va ria b le o f t yp e k_sigaction wit h t h e co rre s p o n d in g fie ld s o f *act: _ _get_user(new_ka.sa.sa_handler, &act->sa_handler); _ _get_user(new_ka.sa.sa_flags, &act->sa_flags); _ _get_user(mask, &act->sa_mask); siginitset(&new_ka.sa.sa_mask, mask); Th e fu n ct io n in vo ke s do_sigaction( ) t o co p y t h e n e w new_ka t a b le in t o t h e e n t ry a t t h e

sig- 1 p o s it io n o f current->sig->action (t h e n u m b e r o f t h e s ig n a l is o n e h ig h e r t h a n t h e p o s it io n in t h e a rra y b e ca u s e t h e re is n o ze ro s ig n a l) :

k = ¤t->sig->action[sig-1]; spin_lock(¤t->sig->siglock); if (act) { *k = *act; sigdelsetmask(&k->sa.sa_mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); if (k->sa.sa_handler == SIG_IGN || (k->sa.sa_handler == SIG_DFL && (sig == SIGCONT || sig == SIGCHLD || sig == SIGWINCH))) { spin_lock_irq(¤t->sigmask_lock); if (rm_sig_from_queue(sig, current)) recalc_sigpending(current); spin_unlock_irq(¤t->sigmask_lock); } }

Th e POS IX s t a n d a rd re q u ire s t h a t s e t t in g a s ig n a l a ct io n t o e it h e r SIG_IGN o r SIG_DFL wh e n t h e d e fa u lt a ct io n is "ig n o re , " ca u s e s a n y p e n d in g s ig n a l o f t h e s a m e t yp e t o b e d is ca rd e d . Mo re o ve r, n o t ice t h a t n o m a t t e r wh a t t h e re q u e s t e d m a s ke d s ig n a ls a re fo r t h e s ig n a l h a n d le r, SIGKILL a n d SIGSTOP a re n e ve r m a s ke d .

If t h e oact p a ra m e t e r is n o t NULL, t h e co n t e n t s o f t h e p re vio u s sigaction t a b le a re co p ie d t o t h e p ro ce s s a d d re s s s p a ce a t t h e a d d re s s s p e cifie d b y t h a t p a ra m e t e r:

if (oact) { _ _put_user(old_ka.sa.sa_handler, &oact->sa_handler); _ _put_user(old_ka.sa.sa_flags, &oact->sa_flags); _ _put_user(old_ka.sa.sa_mask.sig[0], &oact->sa_mask); } No t ice t h a t t h e sigaction( ) s ys t e m ca ll a ls o a llo ws in it ia liza t io n o f t h e sa_flags fie ld in t h e sigaction t a b le . We lis t e d t h e va lu e s a llo we d fo r t h is fie ld a n d t h e re la t e d m e a n in g s in Ta b le 1 0 - 4 ( e a rlie r in t h is ch a p t e r) . Old e r S ys t e m V Un ix va ria n t s o ffe re d t h e signal( ) s ys t e m ca ll, wh ich is s t ill wid e ly u s e d b y p ro g ra m m e rs . Re ce n t C lib ra rie s im p le m e n t signal( ) b y m e a n s o f sigaction( ). Ho we ve r, Lin u x s t ill s u p p o rt s o ld e r C lib ra rie s a n d o ffe rs t h e sys_signal( ) s e rvice ro u t in e :

new_sa.sa.sa_handler = handler; new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK; ret = do_sigaction(sig, &new_sa, &old_sa); return ret ? ret : (unsigned long)old_sa.sa.sa_handler; 10.4.3 Examining the Pending Blocked Signals Th e sigpending( ) s ys t e m ca ll a llo ws a p ro ce s s t o e xa m in e t h e s e t o f p e n d in g b lo cke d s ig n a ls —i. e . , t h o s e t h a t h a ve b e e n ra is e d wh ile b lo cke d . Th e co rre s p o n d in g

sys_sigpending( ) s e rvice ro u t in e a ct s o n a s in g le p a ra m e t e r, set, n a m e ly, t h e a d d re s s o f a u s e r va ria b le wh e re t h e a rra y o f b it s m u s t b e co p ie d :

spin_lock_irq(¤t->sigmask_lock); sigandsets(&pending, ¤t->blocked, ¤t->pending.signal); spin_unlock_irq(¤t->sigmask_lock); copy_to_user(set, &pending, sizeof(sigset_t)); 10.4.4 Modifying the Set of Blocked Signals Th e sigprocmask( ) s ys t e m ca ll a llo ws p ro ce s s e s t o m o d ify t h e s e t o f b lo cke d s ig n a ls ; it a p p lie s o n ly t o re g u la r ( n o n - re a l- t im e ) s ig n a ls . Th e co rre s p o n d in g sys_sigprocmask( ) s e rvice ro u t in e a ct s o n t h re e p a ra m e t e rs :

oset Po in t e r in t h e p ro ce s s a d d re s s s p a ce t o a b it a rra y wh e re t h e p re vio u s b it m a s k m u s t b e s t o re d

set Po in t e r in t h e p ro ce s s a d d re s s s p a ce t o t h e b it a rra y co n t a in in g t h e n e w b it m a s k

how Fla g t h a t m a y h a ve o n e o f t h e fo llo win g va lu e s : SIG_BLOCK

Th e *set b it m a s k a rra y s p e cifie s t h e s ig n a ls t h a t m u s t b e a d d e d t o t h e b it m a s k a rra y o f b lo cke d s ig n a ls .

SIG_UNBLOCK

Th e *set b it m a s k a rra y s p e cifie s t h e s ig n a ls t h a t m u s t b e re m o ve d fro m t h e b it m a s k a rra y o f b lo cke d s ig n a ls .

SIG_SETMASK

Th e *set b it m a s k a rra y s p e cifie s t h e n e w b it m a s k a rra y o f b lo cke d s ig n a ls .

Th e fu n ct io n in vo ke s copy_from_user( ) t o co p y t h e va lu e p o in t e d t o b y t h e set p a ra m e t e r in t o t h e new_set lo ca l va ria b le a n d co p ie s t h e b it m a s k a rra y o f s t a n d a rd b lo cke d s ig n a ls o f current in t o t h e old_set lo ca l va ria b le . It t h e n a ct s a s t h e how fla g s p e cifie s o n t h e s e t wo va ria b le s :

if (copy_from_user(&new_set, set, sizeof(*set))) return -EFAULT; new_set &= ~(sigmask(SIGKILL)|sigmask(SIGSTOP)); spin_lock_irq(¤t->sigmask_lock); old_set = current->blocked.sig[0]; if (how == SIG_BLOCK) sigaddsetmask(¤t->blocked, new_set); else if (how == SIG_UNBLOCK) sigdelsetmask(¤t->blocked, new_set); else if (how == SIG_SETMASK) current->blocked.sig[0] = new_set; else return -EINVAL; recalc_sigpending(current); spin_unlock_irq(¤t->sigmask_lock); if (oset) { if (copy_to_user(oset, &old_set, sizeof(*oset))) return -EFAULT; } return 0; 10.4.5 Suspending the Process Th e sigsuspend( ) s ys t e m ca ll p u t s t h e p ro ce s s in t h e TASK_INTERRUPTIBLE s t a t e , a ft e r h a vin g b lo cke d t h e s t a n d a rd s ig n a ls s p e cifie d b y a b it m a s k a rra y t o wh ich t h e mask

p a ra m e t e r p o in t s . Th e p ro ce s s will wa ke u p o n ly wh e n a n o n ig n o re d , n o n b lo cke d s ig n a l is s e n t t o it . Th e co rre s p o n d in g sys_sigsuspend( ) s e rvice ro u t in e e xe cu t e s t h e s e s t a t e m e n t s :

mask &= ~(sigmask(SIGKILL) | sigmask(SIGSTOP)); spin_lock_irq(¤t->sigmask_lock); saveset = current->blocked; siginitset(¤t->blocked, mask); recalc_sigpending(current); spin_unlock_irq(¤t->sigmask_lock); regs->eax = -EINTR; while (1) { current->state = TASK_INTERRUPTIBLE; schedule( ); if (do_signal(regs, &saveset)) return -EINTR; } Th e schedule( ) fu n ct io n s e le ct s a n o t h e r p ro ce s s t o ru n . Wh e n t h e p ro ce s s t h a t is s u e d t h e

sigsuspend( ) s ys t e m ca ll is e xe cu t e d a g a in , sys_sigsuspend( ) in vo ke s t h e do_signal( ) fu n ct io n t o d e live r t h e s ig n a l t h a t h a s wo ke n u p t h e p ro ce s s . If t h a t fu n ct io n re t u rn s t h e va lu e 1 , t h e s ig n a l is n o t ig n o re d . Th e re fo re t h e s ys t e m ca ll t e rm in a t e s b y re t u rn in g t h e e rro r co d e -EINTR.

Th e sigsuspend( ) s ys t e m ca ll m a y a p p e a r re d u n d a n t , s in ce t h e co m b in e d e xe cu t io n o f

sigprocmask( ) a n d sleep( ) a p p a re n t ly yie ld s t h e s a m e re s u lt . Bu t t h is is n o t t ru e : b e ca u s e o f in t e rle a vin g o f p ro ce s s e xe cu t io n s , o n e m u s t b e co n s cio u s t h a t in vo kin g a s ys t e m ca ll t o p e rfo rm a ct io n A fo llo we d b y a n o t h e r s ys t e m ca ll t o p e rfo rm a ct io n B is n o t e q u iva le n t t o in vo kin g a s in g le s ys t e m ca ll t h a t p e rfo rm s a ct io n A a n d t h e n a ct io n B. In t h e p a rt icu la r ca s e , sigprocmask( ) m ig h t u n b lo ck a s ig n a l t h a t is d e live re d b e fo re in vo kin g sleep( ). If t h is h a p p e n s , t h e p ro ce s s m ig h t re m a in in a TASK_INTERRUPTIBLE s t a t e fo re ve r, wa it in g fo r t h e s ig n a l t h a t wa s a lre a d y d e live re d . On t h e o t h e r h a n d , t h e

sigsuspend( ) s ys t e m ca ll d o e s n o t a llo w s ig n a ls t o b e s e n t a ft e r u n b lo ckin g a n d b e fo re t h e schedule( ) in vo ca t io n b e ca u s e o t h e r p ro ce s s e s ca n n o t g ra b t h e CPU d u rin g t h a t t im e in t e rva l.

10.4.6 System Calls for Real-Time Signals S in ce t h e s ys t e m ca lls p re vio u s ly e xa m in e d a p p ly o n ly t o s t a n d a rd s ig n a ls , a d d it io n a l s ys t e m ca lls m u s t b e in t ro d u ce d t o a llo w Us e r Mo d e p ro ce s s e s t o h a n d le re a l- t im e s ig n a ls . S e ve ra l s ys t e m ca lls fo r re a l- t im e s ig n a ls ( rt_sigaction( ), rt_sigpending( ),

rt_sigprocmask( ), a n d rt_sigsuspend( )) a re s im ila r t o t h o s e d e s crib e d e a rlie r a n d wo n 't b e d is cu s s e d fu rt h e r. Fo r t h e s a m e re a s o n , we wo n 't d is cu s s t wo o t h e r s ys t e m ca lls t h a t d e a l wit h q u e u e s o f re a l- t im e s ig n a ls :

rt_sigqueueinfo( ) S e n d s a re a l- t im e s ig n a l s o t h a t it is a d d e d t o t h e p e n d in g s ig n a l q u e u e o f t h e

d e s t in a t io n p ro ce s s

rt_sigtimedwait( ) De q u e u e s a b lo cke d p e n d in g s ig n a l wit h o u t d e live rin g it a n d re t u rn s t h e s ig n a l n u m b e r t o t h e ca lle r; if n o b lo cke d s ig n a l is p e n d in g , s u s p e n d s t h e cu rre n t p ro ce s s fo r a fixe d a m o u n t o f t im e .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 11. Process Scheduling Like a n y t im e - s h a rin g s ys t e m , Lin u x a ch ie ve s t h e m a g ica l e ffe ct o f a n a p p a re n t s im u lt a n e o u s e xe cu t io n o f m u lt ip le p ro ce s s e s b y s wit ch in g fro m o n e p ro ce s s t o a n o t h e r in a ve ry s h o rt t im e fra m e . Pro ce s s s wit ch in g it s e lf wa s d is cu s s e d in Ch a p t e r 3 ; t h is ch a p t e r d e a ls wit h s ch e d u lin g , wh ich is co n ce rn e d wit h wh e n t o s wit ch a n d wh ich p ro ce s s t o ch o o s e . Th e ch a p t e r co n s is t s o f t h re e p a rt s . S e ct io n 1 1 . 1 in t ro d u ce s t h e ch o ice s m a d e b y Lin u x t o s ch e d u le p ro ce s s e s in t h e a b s t ra ct . S e ct io n 1 1 . 2 d is cu s s e s t h e d a t a s t ru ct u re s u s e d t o im p le m e n t s ch e d u lin g a n d t h e co rre s p o n d in g a lg o rit h m . Fin a lly, S e ct io n 1 1 . 3 d e s crib e s t h e s ys t e m ca lls t h a t a ffe ct p ro ce s s s ch e d u lin g . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

11.1 Scheduling Policy Th e s ch e d u lin g a lg o rit h m o f t ra d it io n a l Un ix o p e ra t in g s ys t e m s m u s t fu lfill s e ve ra l co n flict in g o b je ct ive s : fa s t p ro ce s s re s p o n s e t im e , g o o d t h ro u g h p u t fo r b a ckg ro u n d jo b s , a vo id a n ce o f p ro ce s s s t a rva t io n , re co n cilia t io n o f t h e n e e d s o f lo w- a n d h ig h - p rio rit y p ro ce s s e s , a n d s o o n . Th e s e t o f ru le s u s e d t o d e t e rm in e wh e n a n d h o w t o s e le ct a n e w p ro ce s s t o ru n is ca lle d s ch e d u lin g p o licy . Lin u x s ch e d u lin g is b a s e d o n t h e t im e - s h a rin g t e ch n iq u e a lre a d y in t ro d u ce d in S e ct io n 6 . 3 : s e ve ra l p ro ce s s e s ru n in "t im e m u lt ip le xin g " b e ca u s e t h e CPU t im e is d ivid e d in t o "s lice s , " o n e fo r e a ch ru n n a b le p ro ce s s . [ 1 ] Of co u rs e , a s in g le p ro ce s s o r ca n ru n o n ly o n e p ro ce s s a t a n y g ive n in s t a n t . If a cu rre n t ly ru n n in g p ro ce s s is n o t t e rm in a t e d wh e n it s t im e s lice o r q u a n t u m e xp ire s , a p ro ce s s s wit ch m a y t a ke p la ce . Tim e - s h a rin g re lie s o n t im e r in t e rru p t s a n d is t h u s t ra n s p a re n t t o p ro ce s s e s . No a d d it io n a l co d e n e e d s t o b e in s e rt e d in t h e p ro g ra m s t o e n s u re CPU t im e - s h a rin g . [1]

Re ca ll t h a t s t o p p e d a n d s u s p e n d e d p ro ce s s e s ca n n o t b e s e le ct e d b y t h e s ch e d u lin g a lg o rit h m t o ru n o n t h e CPU.

Th e s ch e d u lin g p o licy is a ls o b a s e d o n ra n kin g p ro ce s s e s a cco rd in g t o t h e ir p rio rit y. Co m p lica t e d a lg o rit h m s a re s o m e t im e s u s e d t o d e rive t h e cu rre n t p rio rit y o f a p ro ce s s , b u t t h e e n d re s u lt is t h e s a m e : e a ch p ro ce s s is a s s o cia t e d wit h a va lu e t h a t d e n o t e s h o w a p p ro p ria t e it is t o b e a s s ig n e d t o t h e CPU. In Lin u x, p ro ce s s p rio rit y is d yn a m ic. Th e s ch e d u le r ke e p s t ra ck o f wh a t p ro ce s s e s a re d o in g a n d a d ju s t s t h e ir p rio rit ie s p e rio d ica lly; in t h is wa y, p ro ce s s e s t h a t h a ve b e e n d e n ie d t h e u s e o f t h e CPU fo r a lo n g t im e in t e rva l a re b o o s t e d b y d yn a m ica lly in cre a s in g t h e ir p rio rit y. Co rre s p o n d in g ly, p ro ce s s e s ru n n in g fo r a lo n g t im e a re p e n a lize d b y d e cre a s in g t h e ir p rio rit y. Wh e n s p e a kin g a b o u t s ch e d u lin g , p ro ce s s e s a re t ra d it io n a lly cla s s ifie d a s "I/ O- b o u n d " o r "CPU- b o u n d . " Th e fo rm e r m a ke h e a vy u s e o f I/ O d e vice s a n d s p e n d m u ch t im e wa it in g fo r I/ O o p e ra t io n s t o co m p le t e ; t h e la t t e r a re n u m b e r- cru n ch in g a p p lica t io n s t h a t re q u ire a lo t o f CPU t im e . An a lt e rn a t ive cla s s ifica t io n d is t in g u is h e s t h re e cla s s e s o f p ro ce s s e s : In t e ra ct iv e p ro ce s s e s Th e s e in t e ra ct co n s t a n t ly wit h t h e ir u s e rs , a n d t h e re fo re s p e n d a lo t o f t im e wa it in g fo r ke yp re s s e s a n d m o u s e o p e ra t io n s . Wh e n in p u t is re ce ive d , t h e p ro ce s s m u s t b e wo ke n u p q u ickly, o r t h e u s e r will fin d t h e s ys t e m t o b e u n re s p o n s ive . Typ ica lly, t h e a ve ra g e d e la y m u s t fa ll b e t we e n 5 0 a n d 1 5 0 m illis e co n d s . Th e va ria n ce o f s u ch d e la y m u s t a ls o b e b o u n d e d , o r t h e u s e r will fin d t h e s ys t e m t o b e e rra t ic. Typ ica l in t e ra ct ive p ro g ra m s a re co m m a n d s h e lls , t e xt e d it o rs , a n d g ra p h ica l a p p lica t io n s . Ba t ch p ro ce s s e s Th e s e d o n o t n e e d u s e r in t e ra ct io n , a n d h e n ce t h e y o ft e n ru n in t h e b a ckg ro u n d . S in ce s u ch p ro ce s s e s d o n o t n e e d t o b e ve ry re s p o n s ive , t h e y a re o ft e n p e n a lize d b y t h e s ch e d u le r. Typ ica l b a t ch p ro g ra m s a re p ro g ra m m in g la n g u a g e co m p ile rs ,

d a t a b a s e s e a rch e n g in e s , a n d s cie n t ific co m p u t a t io n s . Re a l- t im e p ro ce s s e s Th e s e h a ve ve ry s t rin g e n t s ch e d u lin g re q u ire m e n t s . S u ch p ro ce s s e s s h o u ld n e ve r b e b lo cke d b y lo we r- p rio rit y p ro ce s s e s a n d s h o u ld h a ve a s h o rt g u a ra n t e e d re s p o n s e t im e wit h a m in im u m va ria n ce . Typ ica l re a l- t im e p ro g ra m s a re vid e o a n d s o u n d a p p lica t io n s , ro b o t co n t ro lle rs , a n d p ro g ra m s t h a t co lle ct d a t a fro m p h ys ica l s e n s o rs . Th e t wo cla s s ifica t io n s we ju s t o ffe re d a re s o m e wh a t in d e p e n d e n t . Fo r in s t a n ce , a b a t ch p ro ce s s ca n b e e it h e r I/ O- b o u n d ( e . g . , a d a t a b a s e s e rve r) o r CPU- b o u n d ( e . g . , a n im a g e re n d e rin g p ro g ra m ) . Wh ile re a l- t im e p ro g ra m s a re e xp licit ly re co g n ize d a s s u ch b y t h e s ch e d u lin g a lg o rit h m in Lin u x, t h e re is n o wa y t o d is t in g u is h b e t we e n in t e ra ct ive a n d b a t ch p ro g ra m s . To o ffe r a g o o d re s p o n s e t im e t o in t e ra ct ive a p p lica t io n s , Lin u x ( like a ll Un ix ke rn e ls ) im p licit ly fa vo rs I/ O- b o u n d p ro ce s s e s o ve r CPU- b o u n d o n e s . Pro g ra m m e rs m a y ch a n g e t h e s ch e d u lin g p rio rit ie s b y m e a n s o f t h e s ys t e m ca lls illu s t ra t e d in Ta b le 1 1 - 1 . Mo re d e t a ils a re g ive n in S e ct io n 1 1 . 3 .

Ta b le 1 1 - 1 . S y s t e m c a lls re la t e d t o s c h e d u lin g

S y s t e m c a ll

D e s c rip t io n

nice( )

Ch a n g e t h e p rio rit y o f a co n ve n t io n a l p ro ce s s .

getpriority( )

Ge t t h e m a xim u m p rio rit y o f a g ro u p o f co n ve n t io n a l p ro ce s s e s .

setpriority( )

S e t t h e p rio rit y o f a g ro u p o f co n ve n t io n a l p ro ce s s e s .

sched_getscheduler( )

Ge t t h e s ch e d u lin g p o licy o f a p ro ce s s .

sched_setscheduler( )

S e t t h e s ch e d u lin g p o licy a n d p rio rit y o f a p ro ce s s .

sched_getparam( )

Ge t t h e p rio rit y o f a p ro ce s s .

sched_setparam( )

S e t t h e p rio rit y o f a p ro ce s s .

sched_yield( )

Re lin q u is h t h e p ro ce s s o r vo lu n t a rily wit h o u t b lo ckin g .

sched_get_ priority_min( ) Ge t t h e m in im u m p rio rit y va lu e fo r a p o licy. sched_get_ priority_max( ) Ge t t h e m a xim u m p rio rit y va lu e fo r a p o licy.

sched_rr_get_interval( )

Ge t t h e t im e q u a n t u m va lu e fo r t h e Ro u n d Ro b in p o licy.

Mo s t s ys t e m ca lls s h o wn in t h e t a b le a p p ly t o re a l- t im e p ro ce s s e s , t h u s a llo win g u s e rs t o d e ve lo p re a l- t im e a p p lica t io n s . Ho we ve r, Lin u x d o e s n o t s u p p o rt t h e m o s t d e m a n d in g re a lt im e a p p lica t io n s b e ca u s e it s ke rn e l is n o n p re e m p t ive ( s e e t h e la t e r s e ct io n S e ct io n 1 1 . 2 . 3 ) .

11.1.1 Process Preemption As m e n t io n e d in t h e firs t ch a p t e r, Lin u x p ro ce s s e s a re p re e m p t iv e . If a p ro ce s s e n t e rs t h e TASK_RUNNING s t a t e , t h e ke rn e l ch e cks wh e t h e r it s d yn a m ic p rio rit y is g re a t e r t h a n t h e p rio rit y o f t h e cu rre n t ly ru n n in g p ro ce s s . If it is , t h e e xe cu t io n o f current is in t e rru p t e d a n d t h e s ch e d u le r is in vo ke d t o s e le ct a n o t h e r p ro ce s s t o ru n ( u s u a lly t h e p ro ce s s t h a t ju s t b e ca m e ru n n a b le ) . Of co u rs e , a p ro ce s s m a y a ls o b e p re e m p t e d wh e n it s t im e q u a n t u m e xp ire s . As m e n t io n e d in S e ct io n 6 . 3 , wh e n t h is o ccu rs , t h e need_resched fie ld o f t h e cu rre n t p ro ce s s is s e t , s o t h e s ch e d u le r is in vo ke d wh e n t h e t im e r in t e rru p t h a n d le r t e rm in a t e s . Fo r in s t a n ce , le t 's co n s id e r a s ce n a rio in wh ich o n ly t wo p ro g ra m s —a t e xt e d it o r a n d a co m p ile r—a re b e in g e xe cu t e d . Th e t e xt e d it o r is a n in t e ra ct ive p ro g ra m , s o it h a s a h ig h e r d yn a m ic p rio rit y t h a n t h e co m p ile r. Ne ve rt h e le s s , it is o ft e n s u s p e n d e d , s in ce t h e u s e r a lt e rn a t e s b e t we e n p a u s e s fo r t h in k t im e a n d d a t a e n t ry; m o re o ve r, t h e a ve ra g e d e la y b e t we e n t wo ke yp re s s e s is re la t ive ly lo n g . Ho we ve r, a s s o o n a s t h e u s e r p re s s e s a ke y, a n in t e rru p t is ra is e d a n d t h e ke rn e l wa ke s u p t h e t e xt e d it o r p ro ce s s . Th e ke rn e l a ls o d e t e rm in e s t h a t t h e d yn a m ic p rio rit y o f t h e e d it o r is h ig h e r t h a n t h e p rio rit y o f current, t h e cu rre n t ly ru n n in g p ro ce s s ( t h e co m p ile r) , s o it s e t s t h e need_resched fie ld o f t h is p ro ce s s , t h u s fo rcin g t h e s ch e d u le r t o b e a ct iva t e d wh e n t h e ke rn e l fin is h e s h a n d lin g t h e in t e rru p t . Th e s ch e d u le r s e le ct s t h e e d it o r a n d p e rfo rm s a p ro ce s s s wit ch ; a s a re s u lt , t h e e xe cu t io n o f t h e e d it o r is re s u m e d ve ry q u ickly a n d t h e ch a ra ct e r t yp e d b y t h e u s e r is e ch o e d t o t h e s cre e n . Wh e n t h e ch a ra ct e r h a s b e e n p ro ce s s e d , t h e t e xt e d it o r p ro ce s s s u s p e n d s it s e lf wa it in g fo r a n o t h e r ke yp re s s a n d t h e co m p ile r p ro ce s s ca n re s u m e it s e xe cu t io n . Be a wa re t h a t a p re e m p t e d p ro ce s s is n o t s u s p e n d e d , s in ce it re m a in s in t h e TASK_RUNNING s t a t e ; it s im p ly n o lo n g e r u s e s t h e CPU. S o m e re a l- t im e o p e ra t in g s ys t e m s fe a t u re p re e m p t ive ke rn e ls , wh ich m e a n s t h a t a p ro ce s s ru n n in g in Ke rn e l Mo d e ca n b e in t e rru p t e d a ft e r a n y in s t ru ct io n , ju s t a s it ca n in Us e r Mo d e . Th e Lin u x ke rn e l is n o t p re e m p t ive , wh ich m e a n s t h a t a p ro ce s s ca n b e p re e m p t e d o n ly wh ile ru n n in g in Us e r Mo d e ; n o n p re e m p t ive ke rn e l d e s ig n is m u ch s im p le r, s in ce m o s t s yn ch ro n iza t io n p ro b le m s in vo lvin g t h e ke rn e l d a t a s t ru ct u re s a re e a s ily a vo id e d ( s e e S e ct io n 5 . 2 ) .

11.1.2 How Long Must a Quantum Last? Th e q u a n t u m d u ra t io n is crit ica l fo r s ys t e m p e rfo rm a n ce s : it s h o u ld b e n e it h e r t o o lo n g n o r t o o s h o rt . If t h e q u a n t u m d u ra t io n is t o o s h o rt , t h e s ys t e m o ve rh e a d ca u s e d b y p ro ce s s s wit ch e s b e co m e s e xce s s ive ly h ig h . Fo r in s t a n ce , s u p p o s e t h a t a p ro ce s s s wit ch re q u ire s 1 0 m illis e co n d s ; if t h e q u a n t u m is a ls o s e t t o 1 0 m illis e co n d s , t h e n a t le a s t 5 0 p e rce n t o f t h e CPU cycle s will b e d e d ica t e d t o p ro ce s s s wit ch in g . [ 2 ]

[2]

Act u a lly, t h in g s co u ld b e m u ch wo rs e t h a n t h is ; fo r e xa m p le , if t h e t im e re q u ire d fo r t h e p ro ce s s s wit ch is co u n t e d in t h e p ro ce s s q u a n t u m , a ll CPU t im e is d e vo t e d t o t h e p ro ce s s s wit ch a n d n o p ro ce s s ca n p ro g re s s t o wa rd it s t e rm in a t io n .

If t h e q u a n t u m d u ra t io n is t o o lo n g , p ro ce s s e s n o lo n g e r a p p e a r t o b e e xe cu t e d co n cu rre n t ly. Fo r in s t a n ce , le t 's s u p p o s e t h a t t h e q u a n t u m is s e t t o five s e co n d s ; e a ch ru n n a b le p ro ce s s m a ke s p ro g re s s fo r a b o u t five s e co n d s , b u t t h e n it s t o p s fo r a ve ry lo n g t im e ( t yp ica lly, five s e co n d s t im e s t h e n u m b e r o f ru n n a b le p ro ce s s e s ) . It is o ft e n b e lie ve d t h a t a lo n g q u a n t u m d u ra t io n d e g ra d e s t h e re s p o n s e t im e o f in t e ra ct ive a p p lica t io n s . Th is is u s u a lly fa ls e . As d e s crib e d in S e ct io n 1 1 . 1 . 1 e a rlie r in t h is ch a p t e r, in t e ra ct ive p ro ce s s e s h a ve a re la t ive ly h ig h p rio rit y, s o t h e y q u ickly p re e m p t t h e b a t ch p ro ce s s e s , n o m a t t e r h o w lo n g t h e q u a n t u m d u ra t io n is . In s o m e ca s e s , a q u a n t u m d u ra t io n t h a t is t o o lo n g d e g ra d e s t h e re s p o n s ive n e s s o f t h e s ys t e m . Fo r in s t a n ce , s u p p o s e t wo u s e rs co n cu rre n t ly e n t e r t wo co m m a n d s a t t h e re s p e ct ive s h e ll p ro m p t s ; o n e co m m a n d is CPU- b o u n d , wh ile t h e o t h e r is a n in t e ra ct ive a p p lica t io n . Bo t h s h e lls fo rk a n e w p ro ce s s a n d d e le g a t e t h e e xe cu t io n o f t h e u s e r's co m m a n d t o it ; m o re o ve r, s u p p o s e s u ch n e w p ro ce s s e s h a ve t h e s a m e p rio rit y in it ia lly ( Lin u x d o e s n o t kn o w in a d va n ce if a n e xe cu t e d p ro g ra m is b a t ch o r in t e ra ct ive ) . No w if t h e s ch e d u le r s e le ct s t h e CPU- b o u n d p ro ce s s t o ru n , t h e o t h e r p ro ce s s co u ld wa it fo r a wh o le t im e q u a n t u m b e fo re s t a rt in g it s e xe cu t io n . Th e re fo re , if s u ch d u ra t io n is lo n g , t h e s ys t e m co u ld a p p e a r t o b e u n re s p o n s ive t o t h e u s e r t h a t la u n ch e d it . Th e ch o ice o f q u a n t u m d u ra t io n is a lwa ys a co m p ro m is e . Th e ru le o f t h u m b a d o p t e d b y Lin u x is ch o o s e a d u ra t io n a s lo n g a s p o s s ib le , wh ile ke e p in g g o o d s ys t e m re s p o n s e t im e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

11.2 The Scheduling Algorithm Th e Lin u x s ch e d u lin g a lg o rit h m wo rks b y d ivid in g t h e CPU t im e in t o e p o ch s . In a s in g le e p o ch , e ve ry p ro ce s s h a s a s p e cifie d t im e q u a n t u m wh o s e d u ra t io n is co m p u t e d wh e n t h e e p o ch b e g in s . In g e n e ra l, d iffe re n t p ro ce s s e s h a ve d iffe re n t t im e q u a n t u m d u ra t io n s . Th e t im e q u a n t u m va lu e is t h e m a xim u m CPU t im e p o rt io n a s s ig n e d t o t h e p ro ce s s in t h a t e p o ch . Wh e n a p ro ce s s h a s e xh a u s t e d it s t im e q u a n t u m , it is p re e m p t e d a n d re p la ce d b y a n o t h e r ru n n a b le p ro ce s s . Of co u rs e , a p ro ce s s ca n b e s e le ct e d s e ve ra l t im e s fro m t h e s ch e d u le r in t h e s a m e e p o ch , a s lo n g a s it s q u a n t u m h a s n o t b e e n e xh a u s t e d —fo r in s t a n ce , if it s u s p e n d s it s e lf t o wa it fo r I/ O, it p re s e rve s s o m e o f it s t im e q u a n t u m a n d ca n b e s e le ct e d a g a in d u rin g t h e s a m e e p o ch . Th e e p o ch e n d s wh e n a ll ru n n a b le p ro ce s s e s h a ve e xh a u s t e d t h e ir q u a n t a ; in t h is ca s e , t h e s ch e d u le r a lg o rit h m re co m p u t e s t h e t im e - q u a n t u m d u ra t io n s o f a ll p ro ce s s e s a n d a n e w e p o ch b e g in s . Ea ch p ro ce s s h a s a b a s e t im e q u a n t u m , wh ich is t h e t im e - q u a n t u m va lu e a s s ig n e d b y t h e s ch e d u le r t o t h e p ro ce s s if it h a s e xh a u s t e d it s q u a n t u m in t h e p re vio u s e p o ch . Th e u s e rs ca n ch a n g e t h e b a s e t im e q u a n t u m o f t h e ir p ro ce s s e s b y u s in g t h e nice( ) a n d setpriority( ) s ys t e m ca lls ( s e e S e ct io n 1 1 . 3 la t e r in t h is ch a p t e r) . A n e w p ro ce s s a lwa ys in h e rit s t h e b a s e t im e q u a n t u m o f it s p a re n t . Th e INIT_TASK m a cro s e t s t h e va lu e o f t h e in it ia l t im e q u a n t u m o f p ro ce s s 0 ( s w a p p e r) t o DEF_COUNTER; t h a t m a cro is d e fin e d a s fo llo ws :

#define DEF_COUNTER ( 10 * HZ / 100) S in ce HZ ( wh ich d e n o t e s t h e fre q u e n cy o f t im e r in t e rru p t s ) is s e t t o 1 0 0 fo r IBM co m p a t ib le PCs ( s e e S e ct io n 6 . 1 . 3 ) , t h e va lu e o f DEF_COUNTER is 1 0 t icks —t h a t is , a b o u t 1 0 5 m s .

To s e le ct a p ro ce s s t o ru n , t h e Lin u x s ch e d u le r m u s t co n s id e r t h e p rio rit y o f e a ch p ro ce s s . Act u a lly, t h e re a re t wo kin d s o f p rio rit ie s : S t a t ic p rio rit y Th is is a s s ig n e d b y t h e u s e rs t o re a l- t im e p ro ce s s e s a n d ra n g e s fro m 1 t o 9 9 . It is n e ve r ch a n g e d b y t h e s ch e d u le r. Dy n a m ic p rio rit y Th is a p p lie s o n ly t o co n ve n t io n a l p ro ce s s e s ; it is e s s e n t ia lly t h e s u m o f t h e b a s e t im e q u a n t u m ( wh ich is t h e re fo re a ls o ca lle d t h e b a s e p rio rit y o f t h e p ro ce s s ) a n d o f t h e n u m b e r o f t icks o f CPU t im e le ft t o t h e p ro ce s s b e fo re it s q u a n t u m e xp ire s in t h e cu rre n t e p o ch . Of co u rs e , t h e s t a t ic p rio rit y o f a re a l- t im e p ro ce s s is a lwa ys h ig h e r t h a n t h e d yn a m ic p rio rit y o f a co n ve n t io n a l o n e . Th e s ch e d u le r s t a rt s ru n n in g co n ve n t io n a l p ro ce s s e s o n ly wh e n t h e re is n o re a l- t im e p ro ce s s in a TASK_RUNNING s t a t e .

Th e re is a lwa ys a t le a s t o n e ru n n a b le p ro ce s s : t h e s w a p p e r ke rn e l t h re a d , wh ich h a s PID 0 a n d e xe cu t e s o n ly wh e n t h e CPU ca n n o t e xe cu t e o t h e r p ro ce s s e s . As m e n t io n e d in Ch a p t e r 3 , e ve ry CPU o f a m u lt ip ro ce s s o r s ys t e m h a s it s o wn ke rn e l t h re a d wit h PID e q u a l t o 0 .

11.2.1 Data Structures Used by the Scheduler Re ca ll fro m S e ct io n 3 . 2 t h a t t h e p ro ce s s lis t lin ks a ll p ro ce s s d e s crip t o rs , wh ile t h e ru n q u e u e lis t lin ks t h e p ro ce s s d e s crip t o rs o f a ll ru n n a b le p ro ce s s e s —t h a t is , o f t h o s e in a TASK_RUNNING s t a t e . In b o t h ca s e s , t h e init_task p ro ce s s d e s crip t o r p la ys t h e ro le o f lis t h e a d e r.

11.2.1.1 Process descriptor

Ea ch p ro ce s s d e s crip t o r in clu d e s s e ve ra l fie ld s re la t e d t o s ch e d u lin g :

need_resched A fla g ch e cke d b y ret_from_sys_call( ) t o d e cid e wh e t h e r t o in vo ke t h e schedule( ) fu n ct io n ( s e e S e ct io n 4 . 8 . 3 ) . [ 3 ] [3]

Be s id e t h e va lu e s 0 ( fa ls e ) a n d 1 ( t ru e ) , t h e need_resched fie ld o f a s w a p p e r ke rn e l

t h re a d ( PID 0 ) in a m u lt ip ro ce s s o r s ys t e m ca n a ls o a s s u m e t h e va lu e - 1 ; s e e t h e la t e r s e ct io n S e ct io n 1 1 . 2 . 2 . 6 fo r d e t a ils .

policy Th e s ch e d u lin g cla s s . Th e va lu e s p e rm it t e d a re : SCHED_FIFO

A Firs t - In , Firs t - Ou t re a l- t im e p ro ce s s . Wh e n t h e s ch e d u le r a s s ig n s t h e CPU t o t h e p ro ce s s , it le a ve s t h e p ro ce s s d e s crip t o r in it s cu rre n t p o s it io n in t h e ru n q u e u e lis t . If n o o t h e r h ig h e r- p rio rit y re a l- t im e p ro ce s s is ru n n a b le , t h e p ro ce s s co n t in u e s t o u s e t h e CPU a s lo n g a s it wis h e s , e ve n if o t h e r re a l- t im e p ro ce s s e s t h a t h a ve t h e s a m e p rio rit y a re ru n n a b le .

SCHED_RR

A Ro u n d Ro b in re a l- t im e p ro ce s s . Wh e n t h e s ch e d u le r a s s ig n s t h e CPU t o t h e p ro ce s s , it p u t s t h e p ro ce s s d e s crip t o r a t t h e e n d o f t h e ru n q u e u e lis t . Th is p o licy e n s u re s a fa ir a s s ig n m e n t o f CPU t im e t o a ll SCHED_RR re a l- t im e p ro ce s s e s t h a t h a ve t h e s a m e p rio rit y.

SCHED_OTHER

A co n ve n t io n a l, t im e - s h a re d p ro ce s s .

Th e policy fie ld a ls o e n co d e s a SCHED_YIELD b in a ry fla g . Th is fla g is s e t wh e n t h e p ro ce s s in vo ke s t h e sched_ yield( ) s ys t e m ca ll ( a wa y o f vo lu n t a rily re lin q u is h in g t h e p ro ce s s o r wit h o u t t h e n e e d t o s t a rt a n I/ O o p e ra t io n o r g o t o s le e p ; s e e t h e la t e r s e ct io n S e ct io n 1 1 . 3 ) . Th e ke rn e l a ls o s e t s t h e SCHED_YIELD fla g a n d in vo ke s t h e schedule( ) fu n ct io n wh e n e ve r it is e xe cu t in g a lo n g n o n crit ica l t a s k a n d wis h e s t o g ive o t h e r p ro ce s s e s a ch a n ce t o ru n .

rt_priority Th e s t a t ic p rio rit y o f a re a l- t im e p ro ce s s ; va lid p rio rit ie s ra n g e b e t we e n 1 a n d 9 9 . Th e s t a t ic p rio rit y o f a co n ve n t io n a l p ro ce s s m u s t b e s e t t o 0 .

counter Th e n u m b e r o f t icks o f CPU t im e le ft t o t h e p ro ce s s b e fo re it s q u a n t u m e xp ire s ; wh e n a n e w e p o ch b e g in s , t h is fie ld co n t a in s t h e t im e - q u a n t u m d u ra t io n o f t h e p ro ce s s . Re ca ll t h a t t h e update_process_times( ) fu n ct io n d e cre m e n t s t h e counter fie ld o f t h e cu rre n t p ro ce s s b y 1 a t e ve ry t ick.

nice De t e rm in e s t h e le n g t h o f t h e p ro ce s s t im e q u a n t u m wh e n a n e w e p o ch b e g in s . Th is fie ld co n t a in s va lu e s ra n g in g b e t we e n - 2 0 a n d + 1 9 ; n e g a t ive va lu e s co rre s p o n d t o "h ig h p rio rit y" p ro ce s s e s , p o s it ive o n e s t o "lo w p rio rit y" p ro ce s s e s . Th e d e fa u lt va lu e 0 co rre s p o n d s t o n o rm a l p ro ce s s e s .

cpus_allowed

A b it m a s k s p e cifyin g t h e CPUs o n wh ich t h e p ro ce s s is a llo we d t o ru n . In t h e 8 0 x 8 6 a rch it e ct u re , t h e m a xim u m n u m b e r o f p ro ce s s o r is s e t t o 3 2 , s o t h e wh o le m a s k ca n b e e n co d e d in a s in g le in t e g e r fie ld .

cpus_runnable A b it m a s k s p e cifyin g t h e CPU t h a t is e xe cu t in g t h e p ro ce s s , if a n y. If t h e p ro ce s s is n o t e xe cu t e d b y a n y CPU, a ll b it s o f t h e fie ld a re s e t t o 1 . Ot h e rwis e , a ll b it s o f t h e fie ld a re s e t t o 0 , e xce p t t h e b it a s s o cia t e d wit h t h e e xe cu t in g CPU, wh ich is s e t t o 1 . Th is e n co d in g a llo ws t h e ke rn e l t o ve rify wh e t h e r t h e p ro ce s s ca n b e s ch e d u le d o n a g ive n CPU b y s im p ly co m p u t in g t h e lo g ica l AND b e t we e n t h is fie ld , t h e cpus_allowed fie ld , a n d t h e b it m a s k s p e cifyin g t h e CPU.

processor Th e in d e x o f t h e CPU t h a t is e xe cu t in g t h e p ro ce s s , if a n y; o t h e rwis e , t h e in d e x o f t h e la s t CPU t h a t e xe cu t e d t h e p ro ce s s . Wh e n a n e w p ro ce s s is cre a t e d , do_fork( ) s e t s t h e counter fie ld o f b o t h current ( t h e p a re n t ) a n d p ( t h e ch ild ) p ro ce s s e s in t h e fo llo win g wa y:

p->counter = (current->counter + 1) >> 1; current->counter >>= 1; if (!current->counter) current->need_resched = 1; In o t h e r wo rd s , t h e n u m b e r o f t icks le ft t o t h e p a re n t is s p lit in t wo h a lve s : o n e fo r t h e p a re n t a n d o n e fo r t h e ch ild . Th is is d o n e t o p re ve n t u s e rs fro m g e t t in g a n u n lim it e d a m o u n t o f CPU t im e b y u s in g t h e fo llo win g m e t h o d : t h e p a re n t p ro ce s s cre a t e s a ch ild p ro ce s s t h a t ru n s t h e s a m e co d e a n d t h e n kills it s e lf; b y p ro p e rly a d ju s t in g t h e cre a t io n ra t e , t h e ch ild p ro ce s s wo u ld a lwa ys g e t a fre s h q u a n t u m b e fo re t h e q u a n t u m o f it s p a re n t e xp ire s . Th is p ro g ra m m in g t rick d o e s n o t wo rk s in ce t h e ke rn e l d o e s n o t re wa rd fo rks . S im ila rly, a u s e r ca n n o t h o g a n u n fa ir s h a re o f t h e p ro ce s s o r b y s t a rt in g lo t s o f b a ckg ro u n d p ro ce s s e s in a s h e ll o r b y o p e n in g a lo t o f win d o ws o n a g ra p h ica l d e s kt o p . Mo re g e n e ra lly s p e a kin g , a p ro ce s s ca n n o t h o g re s o u rce s ( u n le s s it h a s p rivile g e s t o g ive it s e lf a re a l- t im e p o licy) b y fo rkin g m u lt ip le d e s ce n d e n t s .

11.2.1.2 CPU's data structures Be s id e s t h e fie ld s in clu d e d in e a ch p ro ce s s d e s crip t o r, a d d it io n a l in fo rm a t io n is n e e d e d t o d e s crib e wh a t e a ch CPU is d o in g . To t h a t e n d , t h e s ch e d u le r ca n re ly o n t h e aligned_data a rra y o f NR_CPUS s t ru ct u re s o f t yp e schedule_data. Ea ch s u ch s t ru ct u re co n s is t s o f t wo fie ld s :

curr A p o in t e r t o t h e p ro ce s s d e s crip t o r o f t h e p ro ce s s ru n n in g o n t h a t CPU. Th e fie ld is u s u a lly a cce s s e d b y m e a n s o f t h e cpu_curr(n) m a cro , wh e re n is t h e CPU lo g ica l n u m b e r.

last_schedule Th e va lu e o f t h e 6 4 - b it Tim e S t a m p Co u n t e r wh e n t h e la s t p ro ce s s s wit ch wa s p e rfo rm e d o n t h e CPU. Th e fie ld is u s u a lly a cce s s e d b y m e a n s o f t h e last_schedule(n) m a cro , wh e re n is t h e CPU lo g ica l n u m b e r. Mo s t o f t h e t im e , a n y CPU a cce s s e s o n ly it s o wn a rra y e le m e n t ; it is t h u s co n ve n ie n t t o a lig n t h e e n t rie s o f t h e aligned_data a rra y s o t h a t e ve ry e le m e n t fa lls in a d iffe re n t ca ch e lin e . In t h is wa y, t h e CPUs h a ve a b e t t e r ch a n ce t o fin d t h e ir o wn e le m e n t in t h e h a rd wa re ca ch e .

11.2.2 The schedule( ) Function

Th e schedule( ) fu n ct io n im p le m e n t s t h e s ch e d u le r. It s o b je ct ive is t o fin d a p ro ce s s in t h e ru n q u e u e lis t a n d t h e n a s s ig n t h e CPU t o it . It is in vo ke d , d ire ct ly o r in a la zy ( d e fe rre d ) wa y, b y s e ve ra l ke rn e l ro u t in e s .

11.2.2.1 Direct invocation Th e s ch e d u le r is in vo ke d d ire ct ly wh e n t h e current p ro ce s s m u s t b e b lo cke d rig h t a wa y b e ca u s e t h e re s o u rce it n e e d s is n o t a va ila b le . In t h is ca s e , t h e ke rn e l ro u t in e t h a t wa n t s t o b lo ck it p ro ce e d s a s fo llo ws : 1 . In s e rt s current in t h e p ro p e r wa it q u e u e

2 . Ch a n g e s t h e s t a t e o f current e it h e r t o TASK_INTERRUPTIBLE o r t o TASK_UNINTERRUPTIBLE

3 . In vo ke s schedule( )

4 . Ch e cks wh e t h e r t h e re s o u rce is a va ila b le ; if n o t , g o e s t o S t e p 2 5 . On ce t h e re s o u rce is a va ila b le , re m o ve s current fro m t h e wa it q u e u e

As ca n b e s e e n , t h e ke rn e l ro u t in e ch e cks re p e a t e d ly wh e t h e r t h e re s o u rce n e e d e d b y t h e p ro ce s s is a va ila b le ; if n o t , it yie ld s t h e CPU t o s o m e o t h e r p ro ce s s b y in vo kin g schedule( ). La t e r, wh e n t h e s ch e d u le r o n ce a g a in g ra n t s t h e CPU t o t h e p ro ce s s , t h e a va ila b ilit y o f t h e re s o u rce is re ch e cke d . Th e s e s t e p s a re s im ila r t o t h o s e p e rfo rm e d b y t h e sleep_on( ) a n d interruptible_sleep_on( ) fu n ct io n s d e s crib e d in S e ct io n 3 . 2 . 4 . Th e s ch e d u le r is a ls o d ire ct ly in vo ke d b y m a n y d e vice d rive rs t h a t e xe cu t e lo n g it e ra t ive t a s ks . At e a ch it e ra t io n cycle , t h e d rive r ch e cks t h e va lu e o f t h e need_resched fie ld a n d , if n e ce s s a ry, in vo ke s

schedule( ) t o vo lu n t a rily re lin q u is h t h e CPU. 11.2.2.2 Lazy invocation Th e s ch e d u le r ca n a ls o b e in vo ke d in a la zy wa y b y s e t t in g t h e need_resched fie ld o f current t o 1 . S in ce a ch e ck o n t h e va lu e o f t h is fie ld is a lwa ys m a d e b e fo re re s u m in g t h e e xe cu t io n o f a Us e r Mo d e p ro ce s s ( s e e S e ct io n 4 . 8 ) , schedule( ) will d e fin it e ly b e in vo ke d a t s o m e t im e in t h e n e a r fu t u re .

Fo r in s t a n ce , la zy in vo ca t io n o f t h e s ch e d u le r is p e rfo rm e d in t h e fo llo win g ca s e s : ●

Wh e n current h a s u s e d u p it s q u a n t u m o f CPU t im e ; t h is is d o n e b y t h e

update_process_times( ) fu n ct io n . ●



Wh e n a p ro ce s s is wo ke n u p a n d it s p rio rit y is h ig h e r t h a n t h a t o f t h e cu rre n t p ro ce s s ; t h is t a s k is p e rfo rm e d b y t h e reschedule_idle( ) fu n ct io n , wh ich is u s u a lly in vo ke d b y t h e

wake_up_process( ) fu n ct io n ( s e e S e ct io n 3 . 2 . 2 ) . Wh e n a sched_setscheduler( ) o r sched_ yield( ) s ys t e m ca ll is is s u e d ( s e e S e ct io n 1 1 . 3 la t e r in t h is ch a p t e r) .

11.2.2.3 Actions performed by schedule( ) before a process switch Th e g o a l o f t h e schedule( ) fu n ct io n co n s is t s o f re p la cin g t h e cu rre n t ly e xe cu t in g p ro ce s s wit h a n o t h e r o n e . Th u s , t h e ke y o u t co m e o f t h e fu n ct io n is t o s e t a lo ca l va ria b le ca lle d next s o t h a t it p o in t s t o t h e d e s crip t o r o f t h e p ro ce s s s e le ct e d t o re p la ce current. If n o ru n n a b le p ro ce s s in t h e s ys t e m h a s p rio rit y g re a t e r t h a n t h e p rio rit y o f current, a t t h e e n d , next co in cid e s wit h current a n d n o p ro ce s s s wit ch t a ke s p la ce .

Fo r e fficie n cy re a s o n s , t h e schedule( ) fu n ct io n s t a rt s b y in it ia lizin g a fe w lo ca l va ria b le s :

prev = current; this_cpu = prev->processor; sched_data = & aligned_data[this_cpu]; As yo u s e e , t h e p o in t e r re t u rn e d b y current is s a ve d in prev, t h e lo g ica l n u m b e r o f t h e e xe cu t in g CPU is s a ve d in this_cpu, a n d t h e p o in t e r t o t h e aligned_data a rra y e le m e n t o f t h e CPU is s a ve d in

sched_data. Ne xt , schedule( ) m a ke s s u re t h a t prev d o e s n 't h o ld t h e g lo b a l ke rn e l lo ck o r t h e g lo b a l in t e rru p t lo ck ( s e e S e ct io n 5 . 5 . 2 a n d S e ct io n 5 . 3 . 1 0 ) , a n d t h e n re e n a b le s t h e lo ca l in t e rru p t s :

if (prev->lock_depth >= 0) spin_unlock(&kernel_flag); release_irqlock(this_cpu); _ _sti( ); Ge n e ra lly s p e a kin g , a p ro ce s s s h o u ld n e ve r h o ld a lo ck a cro s s a p ro ce s s s wit ch ; o t h e rwis e , t h e s ys t e m fre e ze s a s s o o n a s a n o t h e r p ro ce s s t rie s t o a cq u ire t h e s a m e lo ck. Ho we ve r, n o t ice t h a t schedule( ) d o e s n 't ch a n g e t h e va lu e o f t h e lock_depth fie ld ; wh e n prev re s u m e s it s e xe cu t io n , it re a cq u ire s t h e

kernel_flag s p in lo ck if t h e va lu e o f t h is fie ld is n o t n e g a t ive . Th u s , t h e g lo b a l ke rn e l lo ck is a u t o m a t ica lly re le a s e d a n d re a cq u ire d a cro s s a p ro ce s s s wit ch . Co n ve rs e ly, t h e g lo b a l in t e rru p t lo ck is n o t a u t o m a t ica lly re a cq u ire d . Be fo re s t a rt in g t o lo o k a t t h e ru n n a b le p ro ce s s e s , schedule( ) m u s t d is a b le t h e lo ca l in t e rru p t s a n d a cq u ire t h e s p in lo ck t h a t p ro t e ct s t h e ru n q u e u e ( s e e s e ct io n S e ct io n 3 . 2 . 2 . 5 ) :

spin_lock_irq(&runqueue_lock); A ch e ck is t h e n m a d e t o d e t e rm in e wh e t h e r prev is a Ro u n d Ro b in re a l- t im e p ro ce s s ( policy fie ld s e t t o

SCHED_RR) t h a t h a s e xh a u s t e d it s q u a n t u m . If s o , schedule( ) a s s ig n s a n e w q u a n t u m t o prev a n d p u t s it a t t h e b o t t o m o f t h e ru n q u e u e lis t :

if (prev->policy == SCHED_RR && !prev->counter) { prev->counter = (20 - prev->nice) / 4 + 1; move_last_runqueue(prev); } Re ca ll t h a t t h e nice fie ld o f a p ro ce s s ra n g e s b e t we e n - 2 0 a n d + 1 9 ; t h e re fo re , schedule( ) re p le n is h e s t h e counter fie ld wit h a n u m b e r o f t icks ra n g in g fro m 1 1 t o 1 . Th e d e fa u lt va lu e o f t h e nice fie ld is 0 , s o u s u a lly t h e p ro ce s s g e t s a n e w q u a n t u m o f 6 t icks , ro u g h ly 6 0 m s . [ 4 ] [4]

Re ca ll t h a t in t h e 8 0 x 8 6 a rch it e ct u re , 1 t ick co rre s p o n d s t o ro u g h ly 1 0 m s ( s e e S e ct io n 6 . 1 . 3 ) . In a ll a rch it e ct u re s , h o we ve r, t h e fo rm u la t h a t co m p u t e s t h e n u m b e r o f t icks in a q u a n t u m is a d a p t e d s o t h e d e fa u lt q u a n t u m h a s a n o rd e r o f m a g n it u d e o f 5 0 m s .

Ne xt , schedule( ) e xa m in e s t h e s t a t e o f prev. If it h a s n o n b lo cke d p e n d in g s ig n a ls a n d it s s t a t e is

TASK_INTERRUPTIBLE, t h e fu n ct io n s e t s t h e p ro ce s s s t a t e t o TASK_RUNNING. Th is a ct io n is n o t t h e s a m e a s a s s ig n in g t h e p ro ce s s o r t o prev; it ju s t g ive s prev a ch a n ce t o b e s e le ct e d fo r e xe cu t io n : if (prev->state == TASK_INTERRUPTIBLE && signal_pending(prev)) prev->state = TASK_RUNNING;

If prev is n o t in t h e TASK_RUNNING s t a t e , schedule( ) wa s d ire ct ly in vo ke d b y t h e p ro ce s s it s e lf b e ca u s e it h a d t o wa it o n s o m e e xt e rn a l re s o u rce ; t h e re fo re , prev m u s t b e re m o ve d fro m t h e ru n q u e u e lis t :

if (prev->state != TASK_RUNNING) del_from_runqueue(prev); Th e fu n ct io n a ls o re s e t s t h e need_resched fie ld o f current, ju s t in ca s e t h e s ch e d u le r wa s a ct iva t e d in t h e la zy wa y:

prev->need_resched = 0; No w t h e t im e h a s co m e fo r schedule( ) t o s e le ct t h e p ro ce s s t o b e e xe cu t e d in t h e n e xt t im e q u a n t u m . To t h a t e n d , t h e fu n ct io n s ca n s t h e ru n q u e u e lis t . Th e o b je ct ive is t o s t o re in next t h e p ro ce s s d e s crip t o r p o in t e r o f t h e h ig h e s t p rio rit y p ro ce s s :

repeat_schedule: next = init_tasks[this_cpu]; c = -1000; list_for_each(tmp, &runqueue_head) { p = list_entry(tmp, struct task_struct, run_list); if (p->cpus_runnable & p->cpus_allowed & (1 active_mm); if (weight > c) c = weight, next = p; } } Th e fu n ct io n in it ia lize s next s o it p o in t s t o t h e p ro ce s s re fe re n ce d b y init_task[this_cpu]—t h a t is , t o t h e p ro ce s s ( s w a p p e r) a s s o cia t e d wit h t h e e xe cu t in g CPU; t h e c lo ca l va ria b le is s e t t o - 1 0 0 0 . As we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 1 1 . 2 . 2 . 5 , t h e goodness( ) fu n ct io n re t u rn s a n in t e g e r t h a t d e n o t e s t h e p rio rit y o f t h e p ro ce s s p a s s e d a s p a ra m e t e r. Wh ile s ca n n in g p ro ce s s e s in t h e ru n q u e u e , schedule( ) co n s id e rs o n ly t h o s e t h a t a re b o t h :

1 . Ru n n a b le o n t h e e xe cu t in g CPU ( cpus_allowed & (1 1) + (20 - p->nice) / 4 + 1; read_unlock(&tasklist_lock); spin_lock_irq(&runqueue_lock); goto repeat_schedule; } In t h is wa y, s u s p e n d e d o r s t o p p e d p ro ce s s e s h a ve t h e ir d yn a m ic p rio rit ie s p e rio d ica lly in cre a s e d . As s t a t e d e a rlie r, t h e ra t io n a le fo r in cre a s in g t h e counter va lu e o f s u s p e n d e d o r s t o p p e d p ro ce s s e s is t o g ive p re fe re n ce t o I/ O- b o u n d p ro ce s s e s . Ho we ve r, n o m a t t e r h o w o ft e n t h e q u a n t u m is in cre a s e d , it s va lu e ca n n e ve r b e co m e g re a t e r t h a n a b o u t 2 3 0 m s . [ 5 ] [5]

Fo r t h e m a t h e m a t ica lly in clin e d , h e re is a s ke t ch o f t h e p ro o f: wh e n a n e w e p o ch s t a rt s , t h e va lu e o f counter is b o u n d e d b y h a lf o f t h e p re vio u s va lu e o f counter p lu s P , wh ich is t h e m a xim u m va lu e t h a t ca n b e a d d e d t o counter. If nice is s e t t o - 2 0 , t h e n P is e q u a l t o 1 1 t icks . S o lvin g t h e re cu rre n ce e q u a t io n yie ld s a s u p p e r b o u n d t h e g e o m e t ric s e rie s P x ( 1 + +

1

/4

+

1

/8

1

/2

+ . . . ) , wh ich co n ve rg e s t o 2 x P ( t h a t is , 2 2 t icks ) .

Le t 's a s s u m e n o w t h a t schedule( ) h a s s e le ct e d it s b e s t ca n d id a t e , a n d t h a t next p o in t s t o it s p ro ce s s d e s crip t o r. Th e fu n ct io n u p d a t e s t h e aligned_data a rra y e le m e n t o f t h e e xe cu t in g CPU ( t h is e le m e n t is re fe re n ce d b y t h e sched_data lo ca l va ria b le ) , writ e s t h e in d e x o f t h e e xe cu t in g CPU in next's p ro ce s s d e s crip t o r, re le a s e s t h e ru n q u e u e lis t s p in lo ck, a n d re e n a b le s lo ca l in t e rru p t s :

sched_data->curr = next; next->processor = this_cpu; next->cpus_runnable = 1UL policy &= ~SCHED_YIELD; if (prev->lock_depth >= 0) spin_lock(&kernel_flag); return; } No t ice t h a t schedule( ) re a cq u ire s t h e g lo b a l ke rn e l lo ck if t h e lock_depth fie ld o f t h e p ro ce s s is n o t n e g a t ive , a s we a n t icip a t e d wh e n we d e s crib e d t h e firs t a ct io n s o f t h e fu n ct io n . If a p ro ce s s o t h e r t h a n prev is s e le ct e d , a p ro ce s s s wit ch m u s t t a ke p la ce . Th e cu rre n t va lu e o f t h e Tim e S t a m p Co u n t e r, fe t ch e d b y m e a n s o f t h e rdtsc a s s e m b ly la n g u a g e in s t ru ct io n , is s t o re d in t h e

last_schedule fie ld o f t h e aligned_data a rra y e le m e n t o f t h e e xe cu t in g CPU: asm volatile("rdtsc" : "=A" (sched_data->last_schedule)); Th e context_swtch fie ld o f kstat is a ls o in cre a s e d b y 1 t o u p d a t e t h e s t a t is t ics m a in t a in e d b y t h e ke rn e l:

kstat.context_swtch++; It is a ls o cru cia l t o s e t u p t h e a d d re s s s p a ce o f next p ro p e rly. We kn o w fro m Ch a p t e r 8 t h a t t h e

active_mm fie ld o f t h e p ro ce s s d e s crip t o r p o in t s t o t h e m e m o ry d e s crip t o r t h a t is e ffe ct ive ly u s e d b y t h e

p ro ce s s , wh ile t h e mm fie ld p o in t s t o t h e m e m o ry d e s crip t o r o wn e d b y t h e p ro ce s s . Fo r n o rm a l p ro ce s s e s , t h e t wo fie ld s h o ld t h e s a m e a d d re s s ; h o we ve r, a ke rn e l t h re a d d o e s n o t h a ve it s o wn a d d re s s s p a ce a n d it s mm fie ld is a lwa ys s e t t o NULL. Th e schedule( ) fu n ct io n e n s u re s t h a t if next is a ke rn e l t h re a d , t h e n it u s e s t h e a d d re s s s p a ce u s e d b y prev:

if (!next->mm) { next->active_mm = prev->active_mm; atomic_inc(&prev->active_mm->mm_count); cpu_tlbstate[this_cpu].state == TLBSTATE_LAZY; } In e a rlie r ve rs io n s o f Lin u x, ke rn e l t h re a d s h a d t h e ir o wn a d d re s s s p a ce . Th a t d e s ig n ch o ice wa s s u b o p t im a l wh e n t h e s ch e d u le r s e le ct e d a ke rn e l t h re a d a s a n e w p ro ce s s t o ru n b e ca u s e ch a n g in g t h e Pa g e Ta b le s wa s u s e le s s ; s in ce a n y ke rn e l t h re a d ru n s in Ke rn e l Mo d e , it u s e s o n ly t h e fo u rt h g ig a b yt e o f t h e lin e a r a d d re s s s p a ce , wh o s e m a p p in g is t h e s a m e fo r a ll p ro ce s s e s in t h e s ys t e m . Eve n wo rs e , writ in g in t o t h e cr3 re g is t e r in va lid a t e s a ll TLB e n t rie s ( s e e S e ct io n 2 . 4 . 8 ) , wh ich le a d s t o a s ig n ifica n t p e rfo rm a n ce p e n a lt y. Lin u x 2 . 4 is m u ch m o re e fficie n t b e ca u s e Pa g e Ta b le s a re n 't t o u ch e d a t a ll if next is a ke rn e l t h re a d . As fu rt h e r o p t im iza t io n , if next is a ke rn e l t h re a d , t h e schedule( ) fu n ct io n s e t s t h e p ro ce s s in t o la zy TLB m o d e ( s e e S e ct io n 2 . 4 . 8 ) . Co n ve rs e ly, if next is a re g u la r p ro ce s s , t h e schedule( ) fu n ct io n re p la ce s t h e a d d re s s s p a ce o f prev wit h t h e o n e o f next:

if (next->mm) switch_mm(prev->active_mm, next->mm, next, this_cpu); If prev is a ke rn e l t h re a d , t h e schedule( ) fu n ct io n re le a s e s t h e a d d re s s s p a ce u s e d b y prev a n d re s e t s

prev->active_mm: if (!prev->mm) { mmdrop(prev->active_mm); prev->active_mm = NULL; } Re ca ll t h a t mmdrop( ) d e cre m e n t s t h e u s a g e co u n t e r o f t h e m e m o ry d e s crip t o r; if t h e co u n t e r re a ch e s 0 , it a ls o fre e s t h e d e s crip t o r t o g e t h e r wit h t h e a s s o cia t e d Pa g e Ta b le s a n d virt u a l m e m o ry re g io n s . No w schedule( ) ca n fin a lly in vo ke switch_to( ) t o p e rfo rm t h e p ro ce s s s wit ch b e t we e n prev a n d

next ( s e e S e ct io n 3 . 3 . 3 ) : switch_to(prev, next, prev); 11.2.2.4 Actions performed by schedule( ) after a process switch Th e in s t ru ct io n s o f t h e schedule( ) fu n ct io n fo llo win g t h e switch_to m a cro in vo ca t io n will n o t b e p e rfo rm e d rig h t a wa y b y t h e next p ro ce s s , b u t a t a la t e r t im e b y prev wh e n t h e s ch e d u le r s e le ct s it a g a in fo r e xe cu t io n . Ho we ve r, a t t h a t m o m e n t , t h e prev lo ca l va ria b le d o e s n o t p o in t t o o u r o rig in a l p ro ce s s t h a t wa s t o b e re p la ce d wh e n we s t a rt e d t h e d e s crip t io n o f schedule( ), b u t ra t h e r t o t h e p ro ce s s t h a t wa s re p la ce d b y o u r o rig in a l prev wh e n it wa s s ch e d u le d a g a in . ( If yo u a re co n fu s e d , g o b a ck a n d re a d S e ct io n 3.3.3.) Th e la s t in s t ru ct io n s o f t h e schedule( ) fu n ct io n a re :

_ _schedule_tail(prev); if (prev->lock_depth >= 0) spin_lock(&kernel_flag);

if (current->need_resched) goto need_resched_back; return; As yo u s e e , schedule( ) in vo ke s _ _schedule_tail( ) ( d e s crib e d n e xt ) , re a cq u ire s t h e g lo b a l ke rn e l lo ck if n e ce s s a ry, a n d ch e cks wh e t h e r s o m e o t h e r p ro ce s s h a s s e t t h e need_resched fie ld o f prev wh ile it wa s n o t ru n n in g . In t h is ca s e , t h e wh o le schedule( ) fu n ct io n is re e xe cu t e d fro m t h e b e g in n in g ; o t h e rwis e , t h e fu n ct io n t e rm in a t e s . In u n ip ro ce s s o r s ys t e m s , t h e _ _schedule_tail( ) fu n ct io n lim it s it s e lf t o cle a r t h e SCHED_YIELD fla g o f t h e policy fie ld o f prev. Co n ve rs e ly, in m u lt ip ro ce s s o r s ys t e m s , t h e fu n ct io n e xe cu t e s co d e t h a t is e s s e n t ia lly e q u iva le n t t o t h e fo llo win g fra g m e n t :

policy = prev->policy; prev->policy = policy & ~SCHED_YIELD; wmb( ); spin_lock(&prev->alloc_lock); prev->cpus_runnable = ~0UL; spin_lock_irqsave(&runqueue_lock, flags); if (prev->state == TASK_RUNNING && prev != init_task[smp_processor_id( )] && prev->cpus_runnable == ~0UL && !(policy & SCHED_YIELD)) reschedule_idle(prev); spin_unlock_irqrestore(&runqueue_lock, flags); spin_unlock(&prev->alloc_lock); Th e wmb( ) m e m o ry b a rrie r is u s e d t o m a ke s u re t h a t t h e p ro ce s s o r wo n 't re s h u ffle t h e a s s e m b ly la n g u a g e in s t ru ct io n s t h a t m o d ify t h e policy fie ld wit h t h o s e t h a t a cq u ire t h e alloc_lock s p in lo ck ( s e e S e ct io n 5 . 3 . 2 ) . As yo u m a y n o t ice , t h e ro le o f _ _schedule_tail( ) is fa r m o re im p o rt a n t in m u lt ip ro ce s s o r s ys t e m s b e ca u s e t h is fu n ct io n ch e cks wh e t h e r t h e p ro ce s s t h a t wa s re p la ce d ca n b e re s ch e d u le d o n s o m e o t h e r CPU. Th is a t t e m p t is p e rfo rm e d o n ly if t h e fo llo win g co n d it io n s a re s a t is fie d : ● ● ● ●

prev is in TASK_RUNNING s t a t e . prev is n o t t h e s w a p p e r p ro ce s s o f t h e e xe cu t in g CPU. Th e SCHED_YIELD fla g o f prev->policy wa s n o t s e t . prev wa s n o t a lre a d y s e le ct e d b y a n o t h e r CPU in t h e t im e fra m e e la p s e d b e t we e n t h e a s s ig n m e n t t o t h e cpus_runnable fie ld a n d t h e if s t a t e m e n t ( t h e if s t a t e m e n t it s e lf is p ro t e ct e d b y t h e runqueue_lock s p in lo ck; s e e t h e co d e s h o wn in t h e p re vio u s s e ct io n ) .

To ch e ck wh e t h e r t h e p rio rit y o f prev is h ig h e n o u g h t o re p la ce t h e cu rre n t p ro ce s s o f s o m e o t h e r CPU, _

_schedule_tail( ) in vo ke s reschedule_idle( ). Th is is t h e s a m e fu n ct io n in vo ke d b y wake_up_process( ) a n d is d e s crib e d in t h e la t e r s e ct io n S e ct io n 1 1 . 2 . 2 . 6 . Th e n e xt t wo s e ct io n s co m p le t e t h e a n a lys is o f t h e s ch e d u le r. Th e y d e s crib e , re s p e ct ive ly, t h e goodness(

) a n d reschedule_idle( ) fu n ct io n s . 11.2.2.5 How good is a runnable process? Th e h e a rt o f t h e s ch e d u lin g a lg o rit h m in clu d e s id e n t ifyin g t h e b e s t ca n d id a t e a m o n g a ll p ro ce s s e s in t h e ru n q u e u e lis t . Th is is wh a t t h e goodness( ) fu n ct io n d o e s . It re ce ive s a s in p u t p a ra m e t e rs :

● ● ●

p, t h e d e s crip t o r p o in t e r o f ca n d id a t e p ro ce s s this_cpu, t h e lo g ica l n u m b e r o f t h e e xe cu t in g CPU this_mm, t h e m e m o ry d e s crip t o r a d d re s s o f t h e p ro ce s s b e in g re p la ce d

Th e in t e g e r va lu e weight re t u rn e d b y goodness( ) m e a s u re s t h e "g o o d n e s s " o f p a n d h a s t h e fo llo win g m e a n in g s :

weight = - 1 p is t h e prev p ro ce s s , a n d it s SCHED_YIELD fla g is s e t . Th e p ro ce s s will b e s e le ct e d o n ly if n o o t h e r ru n n a b le p ro ce s s e s ( b e s id e t h e s w a p p e r p ro ce s s e s ) a re in clu d e d in t h e ru n q u e u e .

weight = 0 p is a co n ve n t io n a l p ro ce s s t h a t h a s e xh a u s t e d it s q u a n t u m ( p->counter is ze ro ) . Un le s s a ll ru n n a b le p ro ce s s e s h a ve a ls o e xh a u s t e d t h e ir q u a n t u m , it will n o t b e s e le ct e d fo r e xe cu t io n . 2 < = weight < = 7 7

p is a co n ve n t io n a l p ro ce s s t h a t h a s n o t e xh a u s t e d it s q u a n t u m . Th e we ig h t is co m p u t e d a s fo llo ws : weight = p->counter + 20 - p->nice; if (p->processor == this_cpu) weight += 15; if (p->mm == this_mm || !p->mm) weight += 1;

In m u lt ip ro ce s s o r s ys t e m s , a la rg e b o n u s ( + 1 5 ) is g ive n t o t h e p ro ce s s if it wa s la s t ru n n in g o n t h e CPU t h a t is e xe cu t in g t h e s ch e d u le r. Th e b o n u s h e lp s in re d u cin g t h e n u m b e r o f t ra n s fe rs o f p ro ce s s e s a cro s s s e ve ra l CPUs d u rin g t h e ir e xe cu t io n s , t h u s yie ld in g a s m a lle r n u m b e r o f h a rd wa re ca ch e m is s e s . Th e fu n ct io n a ls o g ive s a s m a ll b o n u s ( + 1 ) t o t h e p ro ce s s if it is a ke rn e l t h re a d o r it s h a re s t h e m e m o ry a d d re s s s p a ce wit h t h e p re vio u s ly ru n n in g p ro ce s s . Ag a in , t h e p ro ce s s is fa vo re d m a in ly b e ca u s e t h e TLBs m u s t n o t b e in va lid a t e d b y writ in g in t o t h e cr3 re g is t e r.

weight > = 1 0 0 0 p is a re a l- t im e p ro ce s s . Th e we ig h t is g ive n b y p->counter + 1000. 11.2.2.6 Scheduling on multiprocessor systems Wit h re s p e ct t o e a rlie r ve rs io n s , t h e Lin u x 2 . 4 s ch e d u lin g a lg o rit h m h a s b e e n im p ro ve d t o e n h a n ce it s p e rfo rm a n ce o n m u lt ip ro ce s s o r s ys t e m s . It wa s a ls o s im p lifie d , wh ich is a g re a t im p ro ve m e n t b y it s e lf. As we h a ve s e e n , e a ch p ro ce s s o r ru n s t h e schedule( ) fu n ct io n o n it s o wn t o re p la ce t h e p ro ce s s t h a t is cu rre n t ly in e xe cu t io n . Ho we ve r, p ro ce s s o rs a re a b le t o e xch a n g e in fo rm a t io n t o b o o s t s ys t e m p e rfo rm a n ce . In p a rt icu la r, rig h t a ft e r a p ro ce s s s wit ch , a n y p ro ce s s o r u s u a lly ch e cks wh e t h e r t h e ju s t re p la ce d p ro ce s s s h o u ld b e e xe cu t e d o n s o m e o t h e r CPU ru n n in g a lo we r p rio rit y p ro ce s s . Th is ch e ck is p e rfo rm e d b y reschedule_idle( ).

Th e reschedule_idle( ) fu n ct io n lo o ks fo r s o m e o t h e r CPU t o ru n t h e p ro ce s s p p a s s e d a s p a ra m e t e r a n d u s e s in t e rp ro ce s s o r in t e rru p t s t o fo rce o t h e r CPUs t o p e rfo rm s ch e d u lin g . Th e fu n ct io n p e rfo rm s a s e rie s o f t e s t s in a fixe d o rd e r. If o n e o f t h e m is s u cce s s fu l, t h e fu n ct io n s e n d s a RESCHEDULE_VECTOR in t e rp ro ce s s o r in t e rru p t t o t h e s e le ct e d CPU ( s e e S e ct io n 4 . 6 . 2 ) a n d re t u rn s . If a ll t e s t s fa il, t h e fu n ct io n re t u rn s wit h o u t fo rcin g a re s ch e d u lin g . Th e t e s t s a re p e rfo rm e d in t h e fo llo win g o rd e r: 1 . Is t h e CPU t h a t wa s la s t ru n n in g p ( i. e . , t h e o n e h a vin g in d e x p->processor) cu rre n t ly id le ?

best_cpu = p->processor; if ((p->cpus_allowed & p->cpus_runnable & (1 need_resched; init_tasks[best_cpu]->need_resched = 1; if (best_cpu != smp_processor_id( ) && !need_resched) smp_send_reschedule(best_cpu); }

Th is is t h e b e s t p o s s ib le ca s e b e ca u s e n o p ro ce s s is t o b e p re e m p t e d a n d t h e h a rd wa re ca ch e o f t h e p ro ce s s o r is wa rm ( fille d wit h u s e fu l d a t a ) . No t ice t h a t t h is ca s e ca n n o t h a p p e n wh e n reschedule_idle( ) is in vo ke d b y t h e s ch e d u le r b e ca u s e schedule( ) n e ve r re p la ce s a ru n n a b le p ro ce s s wit h t h e s w a p p e r ke rn e l t h re a d . Th is ca s e m a y h a p p e n , h o we ve r, wh e n reschedule_idle( ) is in vo ke d b y wake_up_process( )—t h a t is , wh e n p h a s ju s t b e e n wo ke n u p . To fo rce t h e re s ch e d u lin g o n t h e t a rg e t p ro ce s s o r, t h e need_resched fie ld o f t h e s w a p p e r ke rn e l t h re a d is s e t . If t h e t a rg e t p ro ce s s o r is d iffe re n t fro m t h e o n e e xe cu t in g t h e reschedule_idle(

) fu n ct io n , a RESCHEDULE_VECTOR in t e rp ro ce s s o r in t e rru p t is a ls o ra is e d . In fa ct , t h e id le p ro ce s s o r u s u a lly e xe cu t e s a halt a s s e m b ly la n g u a g e in s t ru ct io n t o s a ve p o we r, a n d it ca n b e wo ke n u p o n ly b y a n in t e rru p t . It is a ls o p o s s ib le , h o we ve r, t o le t t h e s wa p p e r ke rn e l t h re a d a ct ive ly p o ll t h e need_resched fie ld , wa it in g fo r it s va lu e t o ch a n g e fro m - 1 t o + 1 , in o rd e r t o s p e e d u p t h e re s ch e d u lin g a n d a vo id t h e in t e rp ro ce s s o r in t e rru p t . Th is m u ch m o re p o we rco n s u m in g a lg o rit h m ca n b e a ct iva t e d b y p a s s in g t h e "id le = p o ll" p a ra m e t e r t o t h e ke rn e l in t h e b o o t in g p h a s e . 2 . Do e s a n id le p ro ce s s o r e xis t t h a t ca n e xe cu t e p? oldest_idle = -1; for (cpu=0; cpucpus_allowed & p->cpus_runnable & (1 processor; goto send_now_idle; }

Am o n g t h e id le p ro ce s s o rs t h a t ca n e xe cu t e p, t h e fu n ct io n s e le ct s t h e le a s t re ce n t ly a ct ive . Re ca ll t h a t t h e Tim e S t a m p Co u n t e r va lu e o f t h e la s t p ro ce s s s wit ch o f e ve ry CPU is s t o re d in t h e

aligned_data a rra y ( s e e t h e e a rlie r s e ct io n S e ct io n 1 1 . 2 . 1 ) . On ce t h e fu n ct io n fin d s t h e o ld e s t id le CPU, t h e fu n ct io n ju m p s t o t h e co d e a lre a d y d e s crib e d in t h e p re vio u s ca s e t o fo rce a re s ch e d u lin g . Th e ra t io n a le b e h in d t h e "o ld e s t id le ru le " is t h a t t h is CPU is like ly t o h a ve t h e g re a t e s t n u m b e r o f in va lid h a rd wa re ca ch e lin e s . 3 . Do e s t h e re e xis t a p ro ce s s o r t h a t ca n e xe cu t e p a n d wh o s e cu rre n t p ro ce s s h a s lo we r d yn a m ic p rio rit y t h a n p? max_prio = 0; for (cpu=0; cpucpus_allowed & p->cpus_runnable & (1 active_mm) goodness(cpu_curr(cpu), cpu, cpu_curr(cpu)->active_mm); if (prio > max_prio) max_prio = prio, target_tsk = cpu_curr(cpu); }

if (max_prio > 0) { target_tsk->need_resched = 1; if (target_tsk->processor != smp_processor_id( )) smp_send_reschedule(target_tsk->processor); }

reschedule_idle( ) fin d s t h e p ro ce s s o r fo r wh ich t h e d iffe re n ce b e t we e n t h e g o o d n e s s o f re p la cin g t h e cu rre n t p ro ce s s wit h p a n d t h e g o o d n e s s o f re p la cin g t h e cu rre n t p ro ce s s wit h t h e cu rre n t p ro ce s s it s e lf is m a xim u m . Th e fu n ct io n fo rce s a re s ch e d u lin g o n t h e co rre s p o n d in g p ro ce s s o r if t h e m a xim u m is a p o s it ive va lu e . No t ice t h a t t h e fu n ct io n d o e s n 't s im p ly lo o k a t t h e counter a n d nice fie ld s o f t h e p ro ce s s e s ; ra t h e r, it u s e s goodness( ), wh ich t a ke s in t o co n s id e ra t io n t h e co s t o f re p la cin g t h e cu rre n t ly ru n n in g p ro ce s s wit h a n o t h e r p ro ce s s t h a t p o t e n t ia lly u s e s a d iffe re n t a d d re s s s p a ce .

The Hyper-threading Technology Ve ry re ce n t ly, In t e l in t ro d u ce d t h e h yp e r- t h re a d in g t e ch n o lo g y. Ba s ica lly, a h yp e r- t h re a d e d CPU is a m icro p ro ce s s o r t h a t e xe cu t e s t wo t h re a d s o f e xe cu t io n a t o n ce ; it in clu d e s s e ve ra l co p ie s o f t h e in t e rn a l re g is t e rs a n d q u ickly s wit ch e s b e t we e n t h e m . Th a n ks t o t h is a p p ro a ch , t h e m a ch in e cycle s s p e n t wh e n o n e t h re a d is a cce s s in g t h e RAM ca n b e e xp lo it e d b y t h e s e co n d t h re a d . A h yp e r- t h re a d e d CPU is s e e n b y t h e ke rn e l a s t wo d iffe re n t CPUs , s o Lin u x d o e s n o t h a ve t o b e e xp licit ly m a d e a wa re o f it . Ho we ve r, Lin u x b re a ks t h e "o ld e s t id le ru le " a n d fo rce s a n im m e d ia t e re s ch e d u lin g wh e n it d is co ve rs t h a t a h yp e r- t h re a d e d CPU is ru n n in g t wo id le p ro ce s s e s .

11.2.3 Performance of the Scheduling Algorithm Th e s ch e d u lin g a lg o rit h m o f Lin u x is b o t h s e lf- co n t a in e d a n d re la t ive ly e a s y t o fo llo w. Fo r t h is re a s o n , m a n y ke rn e l h a cke rs lo ve t o t ry t o m a ke im p ro ve m e n t s . Ho we ve r, t h e s ch e d u le r is a ra t h e r m ys t e rio u s co m p o n e n t o f t h e ke rn e l. Wh ile yo u ca n ch a n g e it s p e rfo rm a n ce s ig n ifica n t ly b y m o d ifyin g ju s t a fe w ke y p a ra m e t e rs , t h e re is u s u a lly n o t h e o re t ica l s u p p o rt t o ju s t ify t h e re s u lt s o b t a in e d . Fu rt h e rm o re , yo u ca n 't b e s u re t h a t t h e p o s it ive ( o r n e g a t ive ) re s u lt s o b t a in e d will co n t in u e t o h o ld wh e n t h e m ix o f re q u e s t s s u b m it t e d b y t h e u s e rs ( re a l- t im e , in t e ra ct ive , I/ O- b o u n d , b a ckg ro u n d , e t c. ) va rie s s ig n ifica n t ly. Act u a lly, fo r a lm o s t e ve ry p ro p o s e d s ch e d u lin g s t ra t e g y, it is p o s s ib le t o d e rive a n a rt ificia l m ix o f re q u e s t s t h a t yie ld s p o o r s ys t e m p e rfo rm a n ce s . Le t 's t ry t o o u t lin e s o m e p it fa lls o f t h e Lin u x 2 . 4 s ch e d u le r. As it t u rn s o u t , s o m e o f t h e s e lim it a t io n s b e co m e s ig n ifica n t o n la rg e s ys t e m s wit h m a n y u s e rs . On a s in g le wo rks t a t io n t h a t is ru n n in g a fe w t e n s o f p ro ce s s e s a t a t im e , t h e Lin u x 2 . 4 s ch e d u le r is q u it e e fficie n t .

11.2.3.1 The algorithm does not scale well If t h e n u m b e r o f e xis t in g p ro ce s s e s is ve ry la rg e , it is in e fficie n t t o re co m p u t e a ll d yn a m ic p rio rit ie s a t o n ce . In o ld t ra d it io n a l Un ix ke rn e ls , t h e d yn a m ic p rio rit ie s we re re co m p u t e d e ve ry s e co n d , s o t h e p ro b le m wa s e ve n wo rs e . Lin u x t rie s in s t e a d t o m in im ize t h e o ve rh e a d o f t h e s ch e d u le r. Prio rit ie s a re re co m p u t e d o n ly wh e n a ll ru n n a b le p ro ce s s e s h a ve e xh a u s t e d t h e ir t im e q u a n t u m . Th e re fo re , wh e n t h e n u m b e r o f p ro ce s s e s is la rg e , t h e re co m p u t a t io n p h a s e is m o re e xp e n s ive b u t e xe cu t e d le s s fre q u e n t ly. Th is s im p le a p p ro a ch h a s a d is a d va n t a g e : wh e n t h e n u m b e r o f ru n n a b le p ro ce s s e s is ve ry la rg e , I/ Ob o u n d p ro ce s s e s a re s e ld o m b o o s t e d , a n d t h e re fo re , in t e ra ct ive a p p lica t io n s h a ve a lo n g e r re s p o n s e t im e .

11.2.3.2 The predefined quantum is too large for high system loads Th e s ys t e m re s p o n s ive n e s s e xp e rie n ce d b y u s e rs d e p e n d s h e a vily o n t h e s y s t e m lo a d , wh ich is t h e a ve ra g e n u m b e r o f p ro ce s s e s t h a t a re ru n n a b le a n d t h u s wa it in g fo r CPU t im e . [ 6 ]

[6]

Th e uptime p ro g ra m re t u rn s t h e s ys t e m lo a d fo r t h e p a s t 1 , 5 , a n d 1 5 m in u t e s . Th e s a m e in fo rm a t io n ca n b e o b t a in e d b y re a d in g t h e / p ro c/ lo a d a v g file .

As m e n t io n e d b e fo re , s ys t e m re s p o n s ive n e s s a ls o d e p e n d s o n t h e a ve ra g e t im e - q u a n t u m d u ra t io n o f t h e ru n n a b le p ro ce s s e s . In Lin u x, t h e p re d e fin e d t im e q u a n t u m a p p e a rs t o b e t o o la rg e fo r h ig h - e n d m a ch in e s t h a t h a ve a ve ry h ig h e xp e ct e d s ys t e m lo a d .

11.2.3.3 I/O-bound process boosting strategy is not optimal Th e p re fe re n ce fo r I/ O- b o u n d p ro ce s s e s is a g o o d s t ra t e g y t o e n s u re a s h o rt re s p o n s e t im e fo r in t e ra ct ive p ro g ra m s , b u t it is n o t p e rfe ct . In d e e d , s o m e b a t ch p ro g ra m s wit h a lm o s t n o u s e r in t e ra ct io n a re I/ Ob o u n d . Fo r in s t a n ce , co n s id e r a d a t a b a s e s e a rch e n g in e t h a t m u s t t yp ica lly re a d lo t s o f d a t a fro m t h e h a rd d is k, o r a n e t wo rk a p p lica t io n t h a t m u s t co lle ct d a t a fro m a re m o t e h o s t o n a s lo w lin k. Eve n if t h e s e kin d s o f p ro ce s s e s d o n o t n e e d a s h o rt re s p o n s e t im e , t h e y a re b o o s t e d b y t h e s ch e d u lin g a lg o rit h m . On t h e o t h e r h a n d , in t e ra ct ive p ro g ra m s t h a t a re a ls o CPU- b o u n d m a y a p p e a r u n re s p o n s ive t o t h e u s e rs , s in ce t h e in cre m e n t o f d yn a m ic p rio rit y d u e t o I/ O b lo ckin g o p e ra t io n s m a y n o t co m p e n s a t e fo r t h e d e cre m e n t d u e t o CPU u s a g e .

11.2.3.4 Support for real-time applications is weak As s t a t e d in t h e firs t ch a p t e r, n o n p re e m p t ive ke rn e ls a re n o t we ll s u it e d fo r re a l- t im e a p p lica t io n s , s in ce p ro ce s s e s m a y s p e n d s e ve ra l m illis e co n d s in Ke rn e l Mo d e wh ile h a n d lin g a n in t e rru p t o r e xce p t io n . Du rin g t h is t im e , a re a l- t im e p ro ce s s t h a t b e co m e s ru n n a b le ca n n o t b e re s u m e d . Th is is u n a cce p t a b le fo r re a lt im e a p p lica t io n s , wh ich re q u ire p re d ict a b le a n d lo w re s p o n s e t im e s . [ 7 ] [7]

Th e Lin u x ke rn e l h a s b e e n m o d ifie d in s e ve ra l wa ys s o it ca n h a n d le a fe w h a rd re a l- t im e jo b s if t h e y re m a in s h o rt . Ba s ica lly, h a rd wa re in t e rru p t s a re t ra p p e d a n d ke rn e l e xe cu t io n is m o n it o re d b y a kin d o f "s u p e rke rn e l. " Th e s e ch a n g e s d o n o t m a ke Lin u x a t ru e re a l- t im e s ys t e m , t h o u g h .

Fu t u re ve rs io n s o f Lin u x m ig h t a d d re s s t h is p ro b le m , b y e it h e r im p le m e n t in g S VR4 's "fixe d p re e m p t io n p o in t s " o r m a kin g t h e ke rn e l fu lly p re e m p t ive . It re m a in s q u e s t io n a b le , h o we ve r, wh e t h e r t h e s e d e s ig n ch o ice s a re a p p ro p ria t e fo r a g e n e ra l- p u rp o s e o p e ra t in g s ys t e m s s u ch a s Lin u x. Ke rn e l p re e m p t io n , in fa ct , is ju s t o n e o f s e ve ra l n e ce s s a ry co n d it io n s fo r im p le m e n t in g a n e ffe ct ive re a lt im e s ch e d u le r. S e ve ra l o t h e r is s u e s m u s t b e co n s id e re d . Fo r in s t a n ce , re a l- t im e p ro ce s s e s o ft e n m u s t u s e re s o u rce s t h a t a re a ls o n e e d e d b y co n ve n t io n a l p ro ce s s e s . A re a l- t im e p ro ce s s m a y t h u s e n d u p wa it in g u n t il a lo we r- p rio rit y p ro ce s s re le a s e s s o m e re s o u rce . Th is p h e n o m e n o n is ca lle d p rio rit y in v e rs io n . Mo re o ve r, a re a l- t im e p ro ce s s co u ld re q u ire a ke rn e l s e rvice t h a t is g ra n t e d o n b e h a lf o f a n o t h e r lo we rp rio rit y p ro ce s s ( fo r e xa m p le , a ke rn e l t h re a d ) . Th is p h e n o m e n o n is ca lle d h id d e n s ch e d u lin g . An e ffe ct ive re a l- t im e s ch e d u le r s h o u ld a d d re s s a n d re s o lve s u ch p ro b le m s . Lu ckily, a ll t h e s e s h o rt co m in g s h a ve b e e n fixe d in t h e b ra n d n e w s ch e d u le r d e ve lo p e d b y In g o Mo ln a r t h a t is in clu d e d in t h e Lin u x 2 . 5 cu rre n t d e ve lo p m e n t ve rs io n . Th is s ch e d u le r is s o e fficie n t t h a t it h a s b e e n b a ck- p o rt e d t o Lin u x 2 . 4 a n d a d o p t e d b y s o m e co m m e rcia l d is t rib u t io n s o f t h e GNU/ Lin u x s ys t e m . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

11.3 System Calls Related to Scheduling S e ve ra l s ys t e m ca lls h a ve b e e n in t ro d u ce d t o a llo w p ro ce s s e s t o ch a n g e t h e ir p rio rit ie s a n d s ch e d u lin g p o licie s . As a g e n e ra l ru le , u s e rs a re a lwa ys a llo we d t o lo we r t h e p rio rit ie s o f t h e ir p ro ce s s e s . Ho we ve r, if t h e y wa n t t o m o d ify t h e p rio rit ie s o f p ro ce s s e s b e lo n g in g t o s o m e o t h e r u s e r o r if t h e y wa n t t o in cre a s e t h e p rio rit ie s o f t h e ir o wn p ro ce s s e s , t h e y m u s t h a ve s u p e ru s e r p rivile g e s .

11.3.1 The nice( ) System Call Th e nice( )[8] s ys t e m ca ll a llo ws p ro ce s s e s t o ch a n g e t h e ir b a s e p rio rit y. Th e in t e g e r va lu e co n t a in e d in t h e increment p a ra m e t e r is u s e d t o m o d ify t h e nice fie ld o f t h e p ro ce s s d e s crip t o r. Th e n ice Un ix co m m a n d , wh ich a llo ws u s e rs t o ru n p ro g ra m s wit h m o d ifie d s ch e d u lin g p rio rit y, is b a s e d o n t h is s ys t e m ca ll. [8]

S in ce t h is s ys t e m ca ll is u s u a lly in vo ke d t o lo we r t h e p rio rit y o f a p ro ce s s , u s e rs wh o in vo ke it fo r t h e ir p ro ce s s e s a re "n ice " t o o t h e r u s e rs .

Th e sys_nice( ) s e rvice ro u t in e h a n d le s t h e nice( ) s ys t e m ca ll. Alt h o u g h t h e

increment p a ra m e t e r m a y h a ve a n y va lu e , a b s o lu t e va lu e s la rg e r t h a n 4 0 a re t rim m e d d o wn t o 4 0 . Tra d it io n a lly, n e g a t ive va lu e s co rre s p o n d t o re q u e s t s fo r p rio rit y in cre m e n t s a n d re q u ire s u p e ru s e r p rivile g e s , wh ile p o s it ive o n e s co rre s p o n d t o re q u e s t s fo r p rio rit y d e cre m e n t s . In t h e ca s e o f a n e g a t ive in cre m e n t , t h e fu n ct io n in vo ke s t h e capable( ) fu n ct io n t o ve rify wh e t h e r t h e p ro ce s s h a s a CAP_SYS_NICE ca p a b ilit y. We d is cu s s t h a t fu n ct io n , t o g e t h e r wit h t h e n o t io n o f ca p a b ilit y, in Ch a p t e r 2 0 . If t h e u s e r t u rn s o u t t o h a ve t h e ca p a b ilit y re q u ire d t o ch a n g e p rio rit ie s , sys_nice( ) a d d s t h e va lu e o f increment t o t h e nice fie ld o f current. If n e ce s s a ry, t h e va lu e o f t h is fie ld is t rim m e d d o wn s o it wo n 't b e le s s t h a n - 2 0 o r g re a t e r t h a n + 1 9 . Th e nice( ) s ys t e m ca ll is m a in t a in e d fo r b a ckwa rd co m p a t ib ilit y o n ly; it h a s b e e n re p la ce d b y t h e setpriority( ) s ys t e m ca ll d e s crib e d n e xt .

11.3.2 The getpriority( ) and setpriority( ) System Calls Th e nice( ) s ys t e m ca ll a ffe ct s o n ly t h e p ro ce s s t h a t in vo ke s it . Two o t h e r s ys t e m ca lls , d e n o t e d a s getpriority( ) a n d setpriority( ), a ct o n t h e b a s e p rio rit ie s o f a ll p ro ce s s e s in a g ive n g ro u p . getpriority( ) re t u rn s 2 0 m in u s t h e lo we s t nice fie ld va lu e a m o n g a ll p ro ce s s e s in a g ive n g ro u p —t h a t is , t h e h ig h e s t p rio rit y a m o n g t h a t p ro ce s s e s ; setpriority( ) s e t s t h e b a s e p rio rit y o f a ll p ro ce s s e s in a g ive n g ro u p t o a g ive n va lu e .

Th e ke rn e l im p le m e n t s t h e s e s ys t e m ca lls b y m e a n s o f t h e sys_getpriority( ) a n d

sys_setpriority( ) s e rvice ro u t in e s . Bo t h o f t h e m a ct e s s e n t ia lly o n t h e s a m e g ro u p o f p a ra m e t e rs :

which

Th e va lu e t h a t id e n t ifie s t h e g ro u p o f p ro ce s s e s ; it ca n a s s u m e o n e o f t h e fo llo win g : PRIO_PROCESS

S e le ct s t h e p ro ce s s e s a cco rd in g t o t h e ir p ro ce s s ID ( pid fie ld o f t h e p ro ce s s d e s crip t o r) .

PRIO_PGRP

S e le ct s t h e p ro ce s s e s a cco rd in g t o t h e ir g ro u p ID ( pgrp fie ld o f t h e p ro ce s s d e s crip t o r) .

PRIO_USER

S e le ct s t h e p ro ce s s e s a cco rd in g t o t h e ir u s e r ID ( uid fie ld o f t h e p ro ce s s d e s crip t o r) .

who Th e va lu e o f t h e pid, pgrp, o r uid fie ld ( d e p e n d in g o n t h e va lu e o f which) t o b e u s e d fo r s e le ct in g t h e p ro ce s s e s . If who is 0 , it s va lu e is s e t t o t h a t o f t h e co rre s p o n d in g fie ld o f t h e current p ro ce s s .

niceval Th e n e w b a s e p rio rit y va lu e ( n e e d e d o n ly b y sys_setpriority( )) . It s h o u ld ra n g e b e t we e n - 2 0 ( h ig h e s t p rio rit y) a n d + 1 9 ( lo we s t p rio rit y) . As s t a t e d b e fo re , o n ly p ro ce s s e s wit h a CAP_SYS_NICE ca p a b ilit y a re a llo we d t o in cre a s e t h e ir o wn b a s e p rio rit y o r t o m o d ify t h a t o f o t h e r p ro ce s s e s . As we s a w in Ch a p t e r 9 , s ys t e m ca lls re t u rn a n e g a t ive va lu e o n ly if s o m e e rro r o ccu rre d . Fo r t h is re a s o n , getpriority( ) d o e s n o t re t u rn a n o rm a l n ice va lu e ra n g in g b e t we e n 2 0 a n d + 1 9 , b u t ra t h e r a n o n n e g a t ive va lu e ra n g in g b e t we e n 1 a n d 4 0 .

11.3.3 System Calls Related to Real-Time Processes We n o w in t ro d u ce a g ro u p o f s ys t e m ca lls t h a t a llo w p ro ce s s e s t o ch a n g e t h e ir s ch e d u lin g d is cip lin e a n d , in p a rt icu la r, t o b e co m e re a l- t im e p ro ce s s e s . As u s u a l, a p ro ce s s m u s t h a ve a CAP_SYS_NICE ca p a b ilit y t o m o d ify t h e va lu e s o f t h e rt_priority a n d policy p ro ce s s d e s crip t o r fie ld s o f a n y p ro ce s s , in clu d in g it s e lf.

11.3.3.1 The sched_getscheduler( ) and sched_setscheduler( ) system calls Th e sched_ getscheduler( ) s ys t e m ca ll q u e rie s t h e s ch e d u lin g p o licy cu rre n t ly a p p lie d t o t h e p ro ce s s id e n t ifie d b y t h e pid p a ra m e t e r. If pid e q u a ls 0 , t h e p o licy o f t h e ca llin g p ro ce s s is re t rie ve d . On s u cce s s , t h e s ys t e m ca ll re t u rn s t h e p o licy fo r t h e p ro ce s s : SCHED_FIFO, SCHED_RR, o r SCHED_OTHER. Th e co rre s p o n d in g

sys_sched_getscheduler( ) s e rvice ro u t in e in vo ke s find_process_by_pid( ), wh ich lo ca t e s t h e p ro ce s s d e s crip t o r co rre s p o n d in g t o t h e g ive n pid a n d re t u rn s t h e va lu e o f it s policy fie ld .

Th e sched_setscheduler( ) s ys t e m ca ll s e t s b o t h t h e s ch e d u lin g p o licy a n d t h e a s s o cia t e d p a ra m e t e rs fo r t h e p ro ce s s id e n t ifie d b y t h e p a ra m e t e r pid. If pid is e q u a l t o 0 , t h e s ch e d u le r p a ra m e t e rs o f t h e ca llin g p ro ce s s will b e s e t . Th e co rre s p o n d in g sys_sched_setscheduler( ) fu n ct io n ch e cks wh e t h e r t h e s ch e d u lin g p o licy s p e cifie d b y t h e policy p a ra m e t e r a n d t h e n e w s t a t ic p rio rit y s p e cifie d b y t h e param-

>sched_priority p a ra m e t e r a re va lid . It a ls o ch e cks wh e t h e r t h e p ro ce s s h a s CAP_SYS_NICE ca p a b ilit y o r wh e t h e r it s o wn e r h a s s u p e ru s e r rig h t s . If e ve ryt h in g is OK, it e xe cu t e s t h e fo llo win g s t a t e m e n t s :

p->policy = policy; p->rt_priority = param->sched_priority; if (task_on_runqueue(p)) move_first_runqueue(p); current->need_resched = 1; 11.3.3.2 The sched_ getparam( ) and sched_setparam( ) system calls Th e sched_getparam( ) s ys t e m ca ll re t rie ve s t h e s ch e d u lin g p a ra m e t e rs fo r t h e p ro ce s s id e n t ifie d b y pid. If pid is 0 , t h e p a ra m e t e rs o f t h e current p ro ce s s a re re t rie ve d . Th e co rre s p o n d in g sys_sched_getparam( ) s e rvice ro u t in e , a s o n e wo u ld e xp e ct , fin d s t h e p ro ce s s d e s crip t o r p o in t e r a s s o cia t e d wit h pid, s t o re s it s rt_priority fie ld in a lo ca l va ria b le o f t yp e sched_param, a n d in vo ke s copy_to_user( ) t o co p y it in t o t h e p ro ce s s a d d re s s s p a ce a t t h e a d d re s s s p e cifie d b y t h e param p a ra m e t e r.

Th e sched_setparam( ) s ys t e m ca ll is s im ila r t o sched_setscheduler( ). Th e d iffe re n ce is t h a t sched_setscheduler( ) d o e s n o t le t t h e ca lle r s e t t h e policy fie ld 's va lu e . [ 9 ] Th e co rre s p o n d in g sys_sched_setparam( ) s e rvice ro u t in e is a lm o s t id e n t ica l t o sys_sched_setscheduler( ), b u t t h e p o licy o f t h e a ffe ct e d p ro ce s s is n e ve r ch a n g e d .

[9]

Th is a n o m a ly is ca u s e d b y a s p e cific re q u ire m e n t o f t h e POS IX s t a n d a rd .

11.3.3.3 The sched_ yield( ) system call Th e sched_ yield( ) s ys t e m ca ll a llo ws a p ro ce s s t o re lin q u is h t h e CPU vo lu n t a rily wit h o u t b e in g s u s p e n d e d ; t h e p ro ce s s re m a in s in a TASK_RUNNING s t a t e , b u t t h e s ch e d u le r p u t s it a t t h e e n d o f t h e ru n q u e u e lis t . In t h is wa y, o t h e r p ro ce s s e s t h a t h a ve t h e s a m e d yn a m ic p rio rit y h a ve a ch a n ce t o ru n . Th e ca ll is u s e d m a in ly b y SCHED_FIFO p ro ce s s e s .

Th e co rre s p o n d in g sys_sched_ yield( ) s e rvice ro u t in e ch e cks firs t if t h e re is s o m e p ro ce s s in t h e s ys t e m t h a t is ru n n a b le , o t h e r t h a n t h e p ro ce s s e xe cu t in g t h e s ys t e m ca ll a n d t h e s w a p p e r ke rn e l t h re a d s . If t h e re is n o s u ch p ro ce s s , sched_yield( ) re t u rn s wit h o u t p e rfo rm in g a n y a ct io n b e ca u s e n o p ro ce s s wo u ld b e a b le t o u s e t h e fre e d p ro ce s s o r. Ot h e rwis e , t h e fu n ct io n e xe cu t e s t h e fo llo win g s t a t e m e n t s :

if (current->policy == SCHED_OTHER) current->policy |= SCHED_YIELD;

current->need_resched = 1; spin_lock_irq(&runqueue_lock); move_last_runqueue(current); spin_unlock_irq(&runqueue_lock); As a re s u lt , schedule( ) is in vo ke d wh e n re t u rn in g fro m t h e sys_sched_ yield( ) s e rvice ro u t in e ( s e e S e ct io n 4 . 8 ) , a n d t h e cu rre n t p ro ce s s will m o s t like ly b e re p la ce d .

11.3.3.4 The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls Th e sched_get_priority_min( ) a n d sched_get_priority_max( ) s ys t e m ca lls re t u rn , re s p e ct ive ly, t h e m in im u m a n d t h e m a xim u m re a l- t im e s t a t ic p rio rit y va lu e t h a t ca n b e u s e d wit h t h e s ch e d u lin g p o licy id e n t ifie d b y t h e policy p a ra m e t e r.

Th e sys_sched_get_priority_min( ) s e rvice ro u t in e re t u rn s 1 if current is a re a lt im e p ro ce s s , 0 o t h e rwis e . Th e sys_sched_get_priority_max( ) s e rvice ro u t in e re t u rn s 9 9 ( t h e h ig h e s t p rio rit y) if

current is a re a l- t im e p ro ce s s , 0 o t h e rwis e . 11.3.3.5 The sched_rr_ get_interval( ) system call Th e sched_rr_get_interval( ) s ys t e m writ e s in a s t ru ct u re s t o re d in t h e Us e r Mo d e a d d re s s s p a ce t h e Ro u n d Ro b in t im e q u a n t u m fo r t h e re a l- t im e p ro ce s s id e n t ifie d b y t h e pid p a ra m e t e r. If pid is ze ro , t h e s ys t e m ca ll writ e s t h e t im e q u a n t u m o f t h e cu rre n t p ro ce s s .

Th e co rre s p o n d in g sys_sched_rr_get_interval( ) s e rvice ro u t in e in vo ke s , a s u s u a l,

find_process_by_pid( ) t o re t rie ve t h e p ro ce s s d e s crip t o r a s s o cia t e d wit h pid. It t h e n co n ve rt s t h e n u m b e r o f t icks s t o re d in t h e nice fie ld o f t h e s e le ct e d p ro ce s s d e s crip t o r in t o s e co n d s a n d n a n o s e co n d s a n d co p ie s t h e n u m b e rs in t o t h e Us e r Mo d e s t ru ct u re . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 12. The Virtual Filesystem On e o f Lin u x's ke ys t o s u cce s s is it s a b ilit y t o co e xis t co m fo rt a b ly wit h o t h e r s ys t e m s . Yo u ca n t ra n s p a re n t ly m o u n t d is ks o r p a rt it io n s t h a t h o s t file fo rm a t s u s e d b y Win d o ws , o t h e r Un ix s ys t e m s , o r e ve n s ys t e m s wit h t in y m a rke t s h a re s like t h e Am ig a . Lin u x m a n a g e s t o s u p p o rt m u lt ip le d is k t yp e s in t h e s a m e wa y o t h e r Un ix va ria n t s d o , t h ro u g h a co n ce p t ca lle d t h e Virt u a l File s ys t e m . Th e id e a b e h in d t h e Virt u a l File s ys t e m is t o p u t a wid e ra n g e o f in fo rm a t io n in t h e ke rn e l t o re p re s e n t m a n y d iffe re n t t yp e s o f file s ys t e m s ; t h e re is a fie ld o r fu n ct io n t o s u p p o rt e a ch o p e ra t io n p ro vid e d b y a n y re a l file s ys t e m s u p p o rt e d b y Lin u x. Fo r e a ch re a d , writ e , o r o t h e r fu n ct io n ca lle d , t h e ke rn e l s u b s t it u t e s t h e a ct u a l fu n ct io n t h a t s u p p o rt s a n a t ive Lin u x file s ys t e m , t h e NT file s ys t e m , o r wh a t e ve r o t h e r file s ys t e m t h e file is o n . Th is ch a p t e r d is cu s s e s t h e a im s , s t ru ct u re , a n d im p le m e n t a t io n o f Lin u x's Virt u a l File s ys t e m . It fo cu s e s o n t h re e o f t h e five s t a n d a rd Un ix file t yp e s —n a m e ly, re g u la r file s , d ire ct o rie s , a n d s ym b o lic lin ks . De vice file s a re co ve re d in Ch a p t e r 1 3 , wh ile p ip e s a re d is cu s s e d in Ch a p t e r 1 9 . To s h o w h o w a re a l file s ys t e m wo rks , Ch a p t e r 1 7 co ve rs t h e S e co n d Ext e n d e d File s ys t e m t h a t a p p e a rs o n n e a rly a ll Lin u x s ys t e m s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.1 The Role of the Virtual Filesystem (VFS) Th e Virt u a l File s y s t e m ( a ls o kn o wn a s Virt u a l File s ys t e m S wit ch o r VFS ) is a ke rn e l s o ft wa re la ye r t h a t h a n d le s a ll s ys t e m ca lls re la t e d t o a s t a n d a rd Un ix file s ys t e m . It s m a in s t re n g t h is p ro vid in g a co m m o n in t e rfa ce t o s e ve ra l kin d s o f file s ys t e m s . Fo r in s t a n ce , le t 's a s s u m e t h a t a u s e r is s u e s t h e s h e ll co m m a n d :

$ cp /floppy/TEST /tmp/test wh e re / flo p p y is t h e m o u n t p o in t o f a n MS - DOS d is ke t t e a n d / t m p is a n o rm a l S e co n d Ext e n d e d File s ys t e m ( Ext 2 ) d ire ct o ry. As s h o wn in Fig u re 1 2 - 1 ( a ) , t h e VFS is a n a b s t ra ct io n la ye r b e t we e n t h e a p p lica t io n p ro g ra m a n d t h e file s ys t e m im p le m e n t a t io n s . Th e re fo re , t h e cp p ro g ra m is n o t re q u ire d t o kn o w t h e file s ys t e m t yp e s o f / flo p p y / TES T a n d / t m p / t e s t . In s t e a d , cp in t e ra ct s wit h t h e VFS b y m e a n s o f g e n e ric s ys t e m ca lls kn o wn t o a n yo n e wh o h a s d o n e Un ix p ro g ra m m in g ( s e e S e ct io n 1 . 5 . 6 ) ; t h e co d e e xe cu t e d b y cp is s h o wn in Fig u re 12-1(b). Fig u re 1 2 - 1 . VFS ro le in a s im p le file c o p y o p e ra t io n

File s ys t e m s s u p p o rt e d b y t h e VFS m a y b e g ro u p e d in t o t h re e m a in cla s s e s : Dis k - b a s e d file s y s t e m s Th e s e m a n a g e t h e m e m o ry s p a ce a va ila b le in a lo ca l d is k p a rt it io n . S o m e o f t h e we llkn o wn d is k- b a s e d file s ys t e m s s u p p o rt e d b y t h e VFS a re :



File s ys t e m s fo r Lin u x s u ch a s t h e wid e ly u s e d S e co n d Ext e n d e d File s ys t e m ( Ext 2 ) , t h e re ce n t Th ird Ext e n d e d File s ys t e m ( Ext 3 ) , a n d t h e Re is e r File s ys t e m s ( Re is e rFS ) [ 1 ] [1]

Alt h o u g h t h e s e file s ys t e m s o we t h e ir b irt h t o Lin u x, t h e y h a ve b e e n p o rt e d t o s e ve ra l o t h e r o p e ra t in g s ys t e m s .

File s ys t e m s fo r Un ix va ria n t s s u ch a s S YS V file s ys t e m ( S ys t e m V, Co h e re n t , XENIX) , UFS ( BS D, S o la ris , Ne xt ) , MINIX file s ys t e m , a n d VERITAS VxFS ( S CO Un ixWa re ) ❍ Micro s o ft file s ys t e m s s u ch a s MS - DOS , VFAT ( Win d o ws 9 5 a n d la t e r re le a s e s ) , a n d NTFS ( Win d o ws NT) ❍ IS O9 6 6 0 CD- ROM file s ys t e m ( fo rm e rly Hig h S ie rra File s ys t e m ) a n d Un ive rs a l Dis k Fo rm a t ( UDF) DVD file s ys t e m ❍ Ot h e r p ro p rie t a ry file s ys t e m s s u ch a s IBM's OS / 2 ( HPFS ) , Ap p le 's Ma cin t o s h ( HFS ) , Am ig a 's Fa s t File s ys t e m ( AFFS ) , a n d Aco rn Dis k Filin g S ys t e m ( ADFS ) ❍ Ad d it io n a l jo u rn a lin g file s ys t e m s o rig in a t in g in s ys t e m s o t h e r t h a n Lin u x s u ch a s IBM's JFS a n d S GI's XFS . Ne t w o rk file s y s t e m s ❍

Th e s e a llo w e a s y a cce s s t o file s in clu d e d in file s ys t e m s b e lo n g in g t o o t h e r n e t wo rke d co m p u t e rs . S o m e we ll- kn o wn n e t wo rk file s ys t e m s s u p p o rt e d b y t h e VFS a re NFS , Co d a , AFS ( An d re w file s ys t e m ) , S MB ( S e rve r Me s s a g e Blo ck, u s e d in Micro s o ft Win d o ws a n d IBM OS / 2 LAN Ma n a g e r t o s h a re file s a n d p rin t e rs ) , a n d NCP ( No ve ll's Ne t Wa re Co re Pro t o co l) . S p e cia l file s y s t e m s ( a ls o ca lle d v irt u a l file s y s t e m s ) Th e s e d o n o t m a n a g e d is k s p a ce , e it h e r lo ca lly o r re m o t e ly. Th e / p ro c file s ys t e m is a t yp ica l e xa m p le o f a s p e cia l file s ys t e m ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 3 . 1 ) . In t h is b o o k, we d e s crib e in d e t a il t h e Ext 2 a n d Ext 3 file s ys t e m s o n ly ( s e e Ch a p t e r 1 7 ) ; t h e o t h e r file s ys t e m s a re n o t co ve re d fo r la ck o f s p a ce . As m e n t io n e d in S e ct io n 1 . 5 , Un ix d ire ct o rie s b u ild a t re e wh o s e ro o t is t h e / d ire ct o ry. Th e ro o t d ire ct o ry is co n t a in e d in t h e ro o t file s y s t e m , wh ich in Lin u x, is u s u a lly o f t yp e Ext 2 . All o t h e r file s ys t e m s ca n b e "m o u n t e d " o n s u b d ire ct o rie s o f t h e ro o t file s ys t e m . [ 2 ] [2]

Wh e n a file s ys t e m is m o u n t e d o n a d ire ct o ry, t h e co n t e n t s o f t h e d ire ct o ry in t h e p a re n t file s ys t e m a re n o lo n g e r a cce s s ib le , s in ce a n y p a t h n a m e , in clu d in g t h e m o u n t p o in t , will re fe r t o t h e m o u n t e d file s ys t e m . Ho we ve r, t h e o rig in a l d ire ct o ry's co n t e n t s h o ws u p a g a in wh e n t h e file s ys t e m is u n m o u n t e d . Th is s o m e wh a t s u rp ris in g fe a t u re o f Un ix file s ys t e m s is u s e d b y s ys t e m a d m in is t ra t o rs t o h id e file s ; t h e y s im p ly m o u n t a file s ys t e m o n t h e d ire ct o ry co n t a in in g t h e file s t o b e h id d e n .

A d is k- b a s e d file s ys t e m is u s u a lly s t o re d in a h a rd wa re b lo ck d e vice like a h a rd d is k, a flo p p y, o r a CD- ROM. A u s e fu l fe a t u re o f Lin u x's VFS a llo ws it t o h a n d le v irt u a l b lo ck d e v ice s like / d e v / lo o p 0 , wh ich m a y b e u s e d t o m o u n t file s ys t e m s s t o re d in re g u la r file s . As a p o s s ib le a p p lica t io n , a u s e r m a y p ro t e ct h is o wn p riva t e file s ys t e m b y s t o rin g a n e n cryp t e d ve rs io n o f it in a re g u la r file . Th e firs t Virt u a l File s ys t e m wa s in clu d e d in S u n Micro s ys t e m s 's S u n OS in 1 9 8 6 . S in ce t h e n , m o s t Un ix file s ys t e m s in clu d e a VFS . Lin u x's VFS , h o we ve r, s u p p o rt s t h e wid e s t ra n g e o f file s ys t e m s .

12.1.1 The Common File Model

Th e ke y id e a b e h in d t h e VFS co n s is t s o f in t ro d u cin g a co m m o n file m o d e l ca p a b le o f re p re s e n t in g a ll s u p p o rt e d file s ys t e m s . Th is m o d e l s t rict ly m irro rs t h e file m o d e l p ro vid e d b y t h e t ra d it io n a l Un ix file s ys t e m . Th is is n o t s u rp ris in g , s in ce Lin u x wa n t s t o ru n it s n a t ive file s ys t e m wit h m in im u m o ve rh e a d . Ho we ve r, e a ch s p e cific file s ys t e m im p le m e n t a t io n m u s t t ra n s la t e it s p h ys ica l o rg a n iza t io n in t o t h e VFS 's co m m o n file m o d e l. Fo r in s t a n ce , in t h e co m m o n file m o d e l, e a ch d ire ct o ry is re g a rd e d a s a file , wh ich co n t a in s a lis t o f file s a n d o t h e r d ire ct o rie s . Ho we ve r, s e ve ra l n o n - Un ix d is k- b a s e d file s ys t e m s u s e a File Allo ca t io n Ta b le ( FAT) , wh ich s t o re s t h e p o s it io n o f e a ch file in t h e d ire ct o ry t re e . In t h e s e file s ys t e m s , d ire ct o rie s a re n o t file s . To s t ick t o t h e VFS 's co m m o n file m o d e l, t h e Lin u x im p le m e n t a t io n s o f s u ch FAT- b a s e d file s ys t e m s m u s t b e a b le t o co n s t ru ct o n t h e fly, wh e n n e e d e d , t h e file s co rre s p o n d in g t o t h e d ire ct o rie s . S u ch file s e xis t o n ly a s o b je ct s in ke rn e l m e m o ry. Mo re e s s e n t ia lly, t h e Lin u x ke rn e l ca n n o t h a rd co d e a p a rt icu la r fu n ct io n t o h a n d le a n o p e ra t io n s u ch a s read( ) o r ioctl( ). In s t e a d , it m u s t u s e a p o in t e r fo r e a ch o p e ra t io n ; t h e p o in t e r is m a d e t o p o in t t o t h e p ro p e r fu n ct io n fo r t h e p a rt icu la r file s ys t e m b e in g a cce s s e d . Le t 's illu s t ra t e t h is co n ce p t b y s h o win g h o w t h e read( ) s h o wn in Fig u re 1 2 - 1 wo u ld b e t ra n s la t e d b y t h e ke rn e l in t o a ca ll s p e cific t o t h e MS - DOS file s ys t e m . Th e a p p lica t io n 's ca ll t o read( ) m a ke s t h e ke rn e l in vo ke t h e co rre s p o n d in g sys_read( ) s e rvice ro u t in e , ju s t like a n y o t h e r s ys t e m ca ll. Th e file is re p re s e n t e d b y a file d a t a s t ru ct u re in ke rn e l m e m o ry, a s we s h a ll s e e la t e r in t h is ch a p t e r. Th is d a t a s t ru ct u re co n t a in s a fie ld ca lle d f_op t h a t co n t a in s p o in t e rs t o fu n ct io n s s p e cific t o MS - DOS file s , in clu d in g a fu n ct io n t h a t re a d s a file . sys_read( ) fin d s t h e p o in t e r t o t h is fu n ct io n a n d in vo ke s it . Th u s , t h e a p p lica t io n 's read( ) is t u rn e d in t o t h e ra t h e r in d ire ct ca ll:

file->f_op->read(...); S im ila rly, t h e write( ) o p e ra t io n t rig g e rs t h e e xe cu t io n o f a p ro p e r Ext 2 writ e fu n ct io n a s s o cia t e d wit h t h e o u t p u t file . In s h o rt , t h e ke rn e l is re s p o n s ib le fo r a s s ig n in g t h e rig h t s e t o f p o in t e rs t o t h e file va ria b le a s s o cia t e d wit h e a ch o p e n file , a n d t h e n fo r in vo kin g t h e ca ll s p e cific t o e a ch file s ys t e m t h a t t h e f_op fie ld p o in t s t o .

On e ca n t h in k o f t h e co m m o n file m o d e l a s o b je ct - o rie n t e d , wh e re a n o b je ct is a s o ft wa re co n s t ru ct t h a t d e fin e s b o t h a d a t a s t ru ct u re a n d t h e m e t h o d s t h a t o p e ra t e o n it . Fo r re a s o n s o f e fficie n cy, Lin u x is n o t co d e d in a n o b je ct - o rie n t e d la n g u a g e like C+ + . Ob je ct s a re t h e re fo re im p le m e n t e d a s d a t a s t ru ct u re s wit h s o m e fie ld s p o in t in g t o fu n ct io n s t h a t co rre s p o n d t o t h e o b je ct 's m e t h o d s . Th e co m m o n file m o d e l co n s is t s o f t h e fo llo win g o b je ct t yp e s : Th e s u p e rb lo ck o b je ct S t o re s in fo rm a t io n co n ce rn in g a m o u n t e d file s ys t e m . Fo r d is k- b a s e d file s ys t e m s , t h is o b je ct u s u a lly co rre s p o n d s t o a file s y s t e m co n t ro l b lo ck s t o re d o n d is k. Th e in o d e o b je ct S t o re s g e n e ra l in fo rm a t io n a b o u t a s p e cific file . Fo r d is k- b a s e d file s ys t e m s , t h is

o b je ct u s u a lly co rre s p o n d s t o a file co n t ro l b lo ck s t o re d o n d is k. Ea ch in o d e o b je ct is a s s o cia t e d wit h a n in o d e n u m b e r, wh ich u n iq u e ly id e n t ifie s t h e file wit h in t h e file s ys t e m . Th e file o b je ct S t o re s in fo rm a t io n a b o u t t h e in t e ra ct io n b e t we e n a n o p e n file a n d a p ro ce s s . Th is in fo rm a t io n e xis t s o n ly in ke rn e l m e m o ry d u rin g t h e p e rio d wh e n e a ch p ro ce s s a cce s s e s a file . Th e d e n t ry o b je ct S t o re s in fo rm a t io n a b o u t t h e lin kin g o f a d ire ct o ry e n t ry wit h t h e co rre s p o n d in g file . Ea ch d is k- b a s e d file s ys t e m s t o re s t h is in fo rm a t io n in it s o wn p a rt icu la r wa y o n d is k. Fig u re 1 2 - 2 illu s t ra t e s wit h a s im p le e xa m p le h o w p ro ce s s e s in t e ra ct wit h file s . Th re e d iffe re n t p ro ce s s e s h a ve o p e n e d t h e s a m e file , t wo o f t h e m u s in g t h e s a m e h a rd lin k. In t h is ca s e , e a ch o f t h e t h re e p ro ce s s e s u s e s it s o wn file o b je ct , wh ile o n ly t wo d e n t ry o b je ct s a re re q u ire d —o n e fo r e a ch h a rd lin k. Bo t h d e n t ry o b je ct s re fe r t o t h e s a m e in o d e o b je ct , wh ich id e n t ifie s t h e s u p e rb lo ck o b je ct a n d , t o g e t h e r wit h t h e la t t e r, t h e co m m o n d is k file . Fig u re 1 2 - 2 . I n t e ra c t io n b e t w e e n p ro c e s s e s a n d VFS o b je c t s

Be s id e s p ro vid in g a co m m o n in t e rfa ce t o a ll file s ys t e m im p le m e n t a t io n s , t h e VFS h a s a n o t h e r im p o rt a n t ro le re la t e d t o s ys t e m p e rfo rm a n ce . Th e m o s t re ce n t ly u s e d d e n t ry o b je ct s a re co n t a in e d in a d is k ca ch e n a m e d t h e d e n t ry ca ch e , wh ich s p e e d s u p t h e t ra n s la t io n fro m a file p a t h n a m e t o t h e in o d e o f t h e la s t p a t h n a m e co m p o n e n t . Ge n e ra lly s p e a kin g , a d is k ca ch e is a s o ft wa re m e ch a n is m t h a t a llo ws t h e ke rn e l t o ke e p in RAM s o m e in fo rm a t io n t h a t is n o rm a lly s t o re d o n a d is k, s o t h a t fu rt h e r a cce s s e s t o t h a t d a t a ca n b e q u ickly s a t is fie d wit h o u t a s lo w a cce s s t o t h e d is k it s e lf. [ 3 ] Be s id e t h e d e n t ry ca ch e , Lin u x u s e s o t h e r d is k ca ch e s , like t h e b u ffe r ca ch e a n d t h e p a g e ca ch e , wh ich a re d e s crib e d in fo rt h co m in g ch a p t e rs .

[3]

No t ice h o w a d is k ca ch e d iffe rs fro m a h a rd wa re ca ch e o r a m e m o ry ca ch e , n e it h e r o f wh ich h a s a n yt h in g t o d o wit h d is ks o r o t h e r d e vice s . A h a rd wa re ca ch e is a fa s t s t a t ic RAM t h a t s p e e d s u p re q u e s t s d ire ct e d t o t h e s lo we r d yn a m ic RAM ( s e e S e ct io n 2 . 4 . 7 ) . A m e m o ry ca ch e is a s o ft wa re m e ch a n is m in t ro d u ce d t o b yp a s s t h e Ke rn e l Me m o ry Allo ca t o r ( s e e S e ct io n 7 . 2 . 1 ) . 12.1.2 System Calls Handled by the VFS Ta b le 1 2 - 1 illu s t ra t e s t h e VFS s ys t e m ca lls t h a t re fe r t o file s ys t e m s , re g u la r file s , d ire ct o rie s , a n d s ym b o lic lin ks . A fe w o t h e r s ys t e m ca lls h a n d le d b y t h e VFS , s u ch a s ioperm( ),

ioctl( ), pipe( ), a n d mknod( ), re fe r t o d e vice file s a n d p ip e s . Th e s e a re d is cu s s e d in la t e r ch a p t e rs . A la s t g ro u p o f s ys t e m ca lls h a n d le d b y t h e VFS , s u ch a s socket( ), connect( ), bind( ), a n d protocols( ), re fe r t o s o cke t s a n d a re u s e d t o im p le m e n t n e t wo rkin g ; s o m e o f t h e m a re d is cu s s e d in Ch a p t e r 1 8 . S o m e o f t h e ke rn e l s e rvice ro u t in e s t h a t co rre s p o n d t o t h e s ys t e m ca lls lis t e d in Ta b le 1 2 - 1 a re d is cu s s e d e it h e r in t h is ch a p t e r o r in Ch a p t e r 1 7 .

Ta b le 1 2 - 1 . S o m e s y s t e m c a lls h a n d le d b y t h e VFS

S y s t e m c a ll n a m e

D e s c rip t io n

mount( ) umount( )

Mo u n t / u n m o u n t file s ys t e m s

sysfs( )

Ge t file s ys t e m in fo rm a t io n

statfs( ) fstatfs( ) ustat( )

Ge t file s ys t e m s t a t is t ics

chroot( ) pivot_root( )

Ch a n g e ro o t d ire ct o ry

chdir( ) fchdir( ) getcwd( )

Ma n ip u la t e cu rre n t d ire ct o ry

mkdir( ) rmdir( )

Cre a t e a n d d e s t ro y d ire ct o rie s

getdents( ) readdir( ) link( ) unlink( ) rename( )

Ma n ip u la t e d ire ct o ry e n t rie s

readlink( ) symlink( )

Ma n ip u la t e s o ft lin ks

chown( ) fchown( ) lchown( )

Mo d ify file o wn e r

chmod( ) fchmod( ) utime( )

Mo d ify file a t t rib u t e s

stat( ) fstat( ) lstat( ) access( )

Re a d file s t a t u s

open( ) close( ) creat( ) umask( )

Op e n a n d clo s e file s

dup( ) dup2( ) fcntl( )

Ma n ip u la t e file d e s crip t o rs

select( ) poll( )

As yn ch ro n o u s I/ O n o t ifica t io n

truncate( ) ftruncate( )

Ch a n g e file s ize

lseek( ) _llseek( )

Ch a n g e file p o in t e r

read( ) write( ) readv( ) writev( ) sendfile( ) readahead( )

Ca rry o u t file I/ O o p e ra t io n s

pread( ) pwrite( )

S e e k file a n d a cce s s it

mmap( ) munmap( ) madvise( ) mincore( )

Ha n d le file m e m o ry m a p p in g

fdatasync( ) fsync( ) sync( ) msync( )

S yn ch ro n ize file d a t a

flock( )

Ma n ip u la t e file lo ck

We s a id e a rlie r t h a t t h e VFS is a la ye r b e t we e n a p p lica t io n p ro g ra m s a n d s p e cific file s ys t e m s . Ho we ve r, in s o m e ca s e s , a file o p e ra t io n ca n b e p e rfo rm e d b y t h e VFS it s e lf, wit h o u t in vo kin g a lo we r- le ve l p ro ce d u re . Fo r in s t a n ce , wh e n a p ro ce s s clo s e s a n o p e n file , t h e file o n d is k d o e s n 't u s u a lly n e e d t o b e t o u ch e d , a n d h e n ce t h e VFS s im p ly re le a s e s t h e co rre s p o n d in g file o b je ct . S im ila rly, wh e n t h e lseek( ) s ys t e m ca ll m o d ifie s a file p o in t e r, wh ich is a n a t t rib u t e re la t e d t o t h e in t e ra ct io n b e t we e n a n o p e n e d file a n d a p ro ce s s , t h e VFS n e e d s t o m o d ify o n ly t h e co rre s p o n d in g file o b je ct wit h o u t a cce s s in g t h e file o n d is k a n d t h e re fo re d o e s n o t h a ve t o in vo ke a s p e cific file s ys t e m p ro ce d u re . In s o m e s e n s e , t h e VFS co u ld b e co n s id e re d a "g e n e ric" file s ys t e m t h a t re lie s , wh e n n e ce s s a ry, o n s p e cific o n e s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.2 VFS Data Structures Ea ch VFS o b je ct is s t o re d in a s u it a b le d a t a s t ru ct u re , wh ich in clu d e s b o t h t h e o b je ct a t t rib u t e s a n d a p o in t e r t o a t a b le o f o b je ct m e t h o d s . Th e ke rn e l m a y d yn a m ica lly m o d ify t h e m e t h o d s o f t h e o b je ct a n d , h e n ce , it m a y in s t a ll s p e cia lize d b e h a vio r fo r t h e o b je ct . Th e fo llo win g s e ct io n s e xp la in t h e VFS o b je ct s a n d t h e ir in t e rre la t io n s h ip s in d e t a il.

12.2.1 Superblock Objects A s u p e rb lo ck o b je ct co n s is t s o f a super_block s t ru ct u re wh o s e fie ld s a re d e s crib e d in Ta b le 12-2.

Ta b le 1 2 - 2 . Th e fie ld s o f t h e s u p e rb lo c k o b je c t

Ty p e

Fie ld

D e s c rip t io n

struct list_head

s_list

Po in t e rs fo r s u p e rb lo ck lis t

kdev_t

s_dev

De vice id e n t ifie r

unsigned long

s_blocksize

Blo ck s ize in b yt e s

unsigned char

s_blocksize_bits Blo ck s ize in n u m b e r o f b it s

unsigned char

s_dirt

Mo d ifie d ( d irt y) fla g

unsigned long long

s_maxbytes

Ma xim u m s ize o f t h e file s

struct file_system_type * s_type

File s ys t e m t yp e

struct super_operations * s_op

S u p e rb lo ck m e t h o d s

struct dquot_operations * dq_op

Dis k q u o t a m e t h o d s

unsigned long

s_flags

Mo u n t fla g s

unsigned long

s_magic

File s ys t e m m a g ic n u m b e r

struct dentry *

s_root

De n t ry o b je ct o f m o u n t d ire ct o ry

struct rw_semaphore

s_umount

S e m a p h o re u s e d fo r u n m o u n t in g

struct semaphore

s_lock

S u p e rb lo ck s e m a p h o re

int

s_count

Re fe re n ce co u n t e r

atomic_t

s_active

S e co n d a ry re fe re n ce co u n t e r

struct list_head

s_dirty

Lis t o f m o d ifie d in o d e s

struct list_head

s_locked_inodes Lis t o f in o d e s in vo lve d in I/ O

struct list_head

s_files

Lis t o f file o b je ct s a s s ig n e d t o t h e s u p e rb lo ck

struct block_device *

s_bdev

Po in t e r t o t h e b lo ck d e vice d rive r d e s crip t o r

struct list_head

s_instances

Po in t e rs fo r a lis t o f s u p e rb lo ck o b je ct s o f a g ive n file s ys t e m t yp e ( s e e S e ct io n 1 2 . 3 . 2 )

struct quota_mount_options s_dquot

Op t io n s fo r d is k q u o t a

union

S p e cific file s ys t e m in fo rm a t io n

u

All s u p e rb lo ck o b je ct s a re lin ke d in a circu la r d o u b ly lin ke d lis t . Th e firs t e le m e n t o f t h is lis t is re p re s e n t e d b y t h e super_blocks va ria b le , wh ile t h e s_list fie ld o f t h e s u p e rb lo ck o b je ct s t o re s t h e p o in t e rs t o t h e a d ja ce n t e le m e n t s in t h e lis t . Th e sb_lock s p in lo ck p ro t e ct s t h e lis t a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e la s t u u n io n fie ld in clu d e s s u p e rb lo ck in fo rm a t io n t h a t b e lo n g s t o a s p e cific file s ys t e m ; fo r in s t a n ce , a s we s h a ll s e e la t e r in Ch a p t e r 1 7 , if t h e s u p e rb lo ck o b je ct re fe rs t o a n Ext 2 file s ys t e m , t h e fie ld s t o re s a n ext2_sb_info s t ru ct u re , wh ich in clu d e s t h e d is k a llo ca t io n b it m a s ks a n d o t h e r d a t a o f n o co n ce rn t o t h e VFS co m m o n file m o d e l. In g e n e ra l, d a t a in t h e u fie ld is d u p lica t e d in m e m o ry fo r re a s o n s o f e fficie n cy. An y d is kb a s e d file s ys t e m n e e d s t o a cce s s a n d u p d a t e it s a llo ca t io n b it m a p s in o rd e r t o a llo ca t e o r re le a s e d is k b lo cks . Th e VFS a llo ws t h e s e file s ys t e m s t o a ct d ire ct ly o n t h e u u n io n fie ld o f t h e s u p e rb lo ck in m e m o ry wit h o u t a cce s s in g t h e d is k. Th is a p p ro a ch le a d s t o a n e w p ro b le m , h o we ve r: t h e VFS s u p e rb lo ck m ig h t e n d u p n o lo n g e r s yn ch ro n ize d wit h t h e co rre s p o n d in g s u p e rb lo ck o n d is k. It is t h u s n e ce s s a ry t o in t ro d u ce a n s_dirt fla g , wh ich s p e cifie s wh e t h e r t h e s u p e rb lo ck is d irt y—t h a t is , wh e t h e r t h e d a t a o n t h e d is k m u s t b e u p d a t e d . Th e la ck o f s yn ch ro n iza t io n le a d s t o t h e fa m ilia r p ro b le m o f a

co rru p t e d file s ys t e m wh e n a s it e 's p o we r g o e s d o wn wit h o u t g ivin g t h e u s e r t h e ch a n ce t o s h u t d o wn a s ys t e m cle a n ly. As we s h a ll s e e in S e ct io n 1 4 . 2 . 4 , Lin u x m in im ize s t h is p ro b le m b y p e rio d ica lly co p yin g a ll d irt y s u p e rb lo cks t o d is k. Th e m e t h o d s a s s o cia t e d wit h a s u p e rb lo ck a re ca lle d s u p e rb lo ck o p e ra t io n s . Th e y a re d e s crib e d b y t h e super_operations s t ru ct u re wh o s e a d d re s s is in clu d e d in t h e s_op fie ld .

Ea ch s p e cific file s ys t e m ca n d e fin e it s o wn s u p e rb lo ck o p e ra t io n s . Wh e n t h e VFS n e e d s t o in vo ke o n e o f t h e m , s a y read_inode( ), it e xe cu t e s t h e fo llo win g :

sb->s_op->read_inode(inode); wh e re sb s t o re s t h e a d d re s s o f t h e s u p e rb lo ck o b je ct in vo lve d . Th e read_inode fie ld o f t h e

super_operations t a b le co n t a in s t h e a d d re s s o f t h e s u it a b le fu n ct io n , wh ich is t h e re fo re d ire ct ly in vo ke d . Le t 's b rie fly d e s crib e t h e s u p e rb lo ck o p e ra t io n s , wh ich im p le m e n t h ig h e r- le ve l o p e ra t io n s like d e le t in g file s o r m o u n t in g d is ks . Th e y a re lis t e d in t h e o rd e r t h e y a p p e a r in t h e super_operations t a b le :

read_inode(inode) Fills t h e fie ld s o f t h e in o d e o b je ct wh o s e a d d re s s is p a s s e d a s t h e p a ra m e t e r fro m t h e d a t a o n d is k; t h e i_ino fie ld o f t h e in o d e o b je ct id e n t ifie s t h e s p e cific file s ys t e m in o d e o n t h e d is k t o b e re a d .

read_inode2(inode, p) S im ila r t o t h e p re vio u s o n e , b u t t h e in o d e is id e n t ifie d b y a 6 4 - b it n u m b e r p o in t e d b y p. Th is m e t h o d s h o u ld d is a p p e a r a s s o o n a s t h e wh o le VFS a rch it e ct u re m o ve s t o 6 4 b it q u a n t it ie s ; fo r n o w, it is u s e d b y t h e Re is e rFS file s ys t e m o n ly.

dirty_inode(inode) In vo ke d wh e n t h e in o d e is m a rke d a s m o d ifie d ( d irt y) . Us e d b y file s ys t e m s like Re is e rFS a n d Ext 3 t o u p d a t e t h e file s ys t e m jo u rn a l o n d is k.

write_inode(inode, flag) Up d a t e s a file s ys t e m in o d e wit h t h e co n t e n t s o f t h e in o d e o b je ct p a s s e d a s t h e p a ra m e t e r; t h e i_ino fie ld o f t h e in o d e o b je ct id e n t ifie s t h e file s ys t e m in o d e o n d is k t h a t is co n ce rn e d . Th e flag p a ra m e t e r in d ica t e s wh e t h e r t h e I/ O o p e ra t io n s h o u ld b e s yn ch ro n o u s .

put_inode(inode) Re le a s e s t h e in o d e o b je ct wh o s e a d d re s s is p a s s e d a s t h e p a ra m e t e r. As u s u a l, re le a s in g a n o b je ct d o e s n o t n e ce s s a rily m e a n fre e in g m e m o ry, s in ce o t h e r p ro ce s s e s m a y s t ill u s e t h a t o b je ct .

delete_inode(inode) De le t e s t h e d a t a b lo cks co n t a in in g t h e file , t h e d is k in o d e , a n d t h e VFS in o d e .

put_super(super) Re le a s e s t h e s u p e rb lo ck o b je ct wh o s e a d d re s s is p a s s e d a s t h e p a ra m e t e r ( b e ca u s e t h e co rre s p o n d in g file s ys t e m is u n m o u n t e d ) .

write_super(super) Up d a t e s a file s ys t e m s u p e rb lo ck wit h t h e co n t e n t s o f t h e o b je ct in d ica t e d .

write_super_lockfs(super) Blo cks ch a n g e s t o t h e file s ys t e m a n d u p d a t e s t h e s u p e rb lo ck wit h t h e co n t e n t s o f t h e o b je ct in d ica t e d . Th e m e t h o d s h o u ld b e im p le m e n t e d b y jo u rn a lin g file s ys t e m s , a n d s h o u ld b e in vo ke d b y t h e Lo g ica l Vo lu m e Ma n a g e r ( LVM) d rive r. It is cu rre n t ly n o t in use.

unlockfs(super) Un d o e s t h e b lo ck o f file s ys t e m u p d a t e s a ch ie ve d b y t h e write_super_lockfs( ) s u p e rb lo ck m e t h o d .

statfs(super, buf) Re t u rn s s t a t is t ics o n a file s ys t e m b y fillin g t h e buf b u ffe r.

remount_fs(super, flags, data) Re m o u n t s t h e file s ys t e m wit h n e w o p t io n s ( in vo ke d wh e n a m o u n t o p t io n m u s t b e ch a n g e d ) .

clear_inode(inode) Like put_inode, b u t a ls o re le a s e s a ll p a g e s t h a t co n t a in d a t a co n ce rn in g t h e file t h a t co rre s p o n d s t o t h e in d ica t e d in o d e .

umount_begin(super) In t e rru p t s a m o u n t o p e ra t io n b e ca u s e t h e co rre s p o n d in g u n m o u n t o p e ra t io n h a s b e e n s t a rt e d ( u s e d o n ly b y n e t wo rk file s ys t e m s ) .

fh_to_dentry(super, filehandle, len, filehandletype. parent)

Us e d b y t h e Ne t wo rk File S ys t e m ( NFS ) ke rn e l t h re a d k n fs d t o re t u rn t h e d e n t ry o b je ct co rre s p o n d in g t o a g ive n file h a n d le . ( A file h a n d le is a n id e n t ifie r o f a NFS file . )

dentry_to_fh(dentry, filehandle, lenp, need_parent) Us e d b y t h e NFS ke rn e l t h re a d k n fs d t o d e rive t h e file h a n d le co rre s p o n d in g t o a g ive n d e n t ry o b je ct .

show_options(seq_file, vfsmount) Us e d t o d is p la y t h e file s ys t e m - s p e cific o p t io n s Th e p re ce d in g m e t h o d s a re a va ila b le t o a ll p o s s ib le file s ys t e m t yp e s . Ho we ve r, o n ly a s u b s e t o f t h e m a p p lie s t o e a ch s p e cific file s ys t e m ; t h e fie ld s co rre s p o n d in g t o u n im p le m e n t e d m e t h o d s a re s e t t o NULL. No t ice t h a t n o read_super m e t h o d t o re a d a s u p e rb lo ck is d e fin e d —h o w co u ld t h e ke rn e l in vo ke a m e t h o d o f a n o b je ct ye t t o b e re a d fro m d is k? We 'll fin d t h e read_super m e t h o d in a n o t h e r o b je ct d e s crib in g t h e file s ys t e m t yp e ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 4 ) .

12.2.2 Inode Objects All in fo rm a t io n n e e d e d b y t h e file s ys t e m t o h a n d le a file is in clu d e d in a d a t a s t ru ct u re ca lle d a n in o d e . A file n a m e is a ca s u a lly a s s ig n e d la b e l t h a t ca n b e ch a n g e d , b u t t h e in o d e is u n iq u e t o t h e file a n d re m a in s t h e s a m e a s lo n g a s t h e file e xis t s . An in o d e o b je ct in m e m o ry co n s is t s o f a n inode s t ru ct u re wh o s e fie ld s a re d e s crib e d in Ta b le 1 2 - 3 .

Ta b le 1 2 - 3 . Th e fie ld s o f t h e in o d e o b je c t

Ty p e

Fie ld

D e s c rip t io n

struct list_head

i_hash

Po in t e rs fo r t h e h a s h lis t

struct list_head

i_list

Po in t e rs fo r t h e in o d e lis t

struct list_head

i_dentry

Po in t e rs fo r t h e d e n t ry lis t

struct list_head

i_dirty_buffers

Po in t e rs fo r t h e m o d ifie d b u ffe rs lis t

struct list_head

i_dirty_data_buffers Po in t e rs fo r t h e m o d ifie d d a t a b u ffe rs lis t

unsigned long

i_ino

in o d e n u m b e r

unsigned int

i_count

Us a g e co u n t e r

kdev_t

i_dev

De vice id e n t ifie r

umode_t

i_mode

File t yp e a n d a cce s s rig h t s

nlink_t

i_nlink

Nu m b e r o f h a rd lin ks

uid_t

i_uid

Own e r id e n t ifie r

gid_t

i_gid

Gro u p id e n t ifie r

kdev_t

i_rdev

Re a l d e vice id e n t ifie r

off_t

i_size

File le n g t h in b yt e s

time_t

i_atime

Tim e o f la s t file a cce s s

time_t

i_mtime

Tim e o f la s t file writ e

time_t

i_ctime

Tim e o f la s t in o d e ch a n g e

unsigned int

i_blkbits

Blo ck s ize in n u m b e r o f b it s

unsigned long

i_blksize

Blo ck s ize in b yt e s

unsigned long

i_blocks

Nu m b e r o f b lo cks o f t h e file

unsigned long

i_version

Ve rs io n n u m b e r, a u t o m a t ica lly in cre m e n t e d a ft e r e a ch u s e

struct semaphore

i_sem

in o d e s e m a p h o re

struct semaphore

i_zombie

S e co n d a ry in o d e s e m a p h o re u s e d wh e n re m o vin g o r re n a m in g t h e in o d e

struct inode_operations * i_op

in o d e o p e ra t io n s

struct file_operations * i_fop

De fa u lt file o p e ra t io n s

struct super_block *

i_sb

Po in t e r t o s u p e rb lo ck o b je ct

wait_queue_head_t

i_wait

in o d e wa it q u e u e

struct file_lock *

i_flock

Po in t e r t o file lo ck lis t

struct address_space *

i_mapping

Po in t e r t o a n address_space o b je ct ( s e e Ch a p t e r 1 4 )

struct address_space

i_data

address_space o b je ct fo r b lo ck d e vice file

struct dquot **

i_dquot

in o d e d is k q u o t a s

struct list_head

i_devices

Po in t e rs o f a lis t o f b lo ck d e vice file in o d e s ( s e e Ch a p t e r 1 3 )

struct pipe_inode_info * i_pipe

Us e d if t h e file is a p ip e ( s e e Ch a p t e r 1 9 )

struct block_device *

i_bdev

Po in t e r t o t h e b lo ck d e vice d rive r

struct char_device *

i_cdev

Po in t e r t o t h e ch a ra ct e r d e vice d rive r

unsigned long

i_dnotify_mask

Bit m a s k o f d ire ct o ry n o t ify e ve n t s

struct dnotify_struct *

i_dnotify

Us e d fo r d ire ct o ry n o t ifica t io n s

unsigned long

i_state

in o d e s t a t e fla g s

unsigned int

i_flags

File s ys t e m m o u n t fla g s

unsigned char

i_sock

No n ze ro if file is a s o cke t

atomic_t

i_writecount

Us a g e co u n t e r fo r writ in g p ro ce s s e s

unsigned int

i_attr_flags

File cre a t io n fla g s

_ _u32

i_generation

in o d e ve rs io n n u m b e r ( u s e d b y s o m e file s ys t e m s )

union

u

S p e cific file s ys t e m in fo rm a t io n

Th e fin a l u u n io n fie ld is u s e d t o in clu d e in o d e in fo rm a t io n t h a t b e lo n g s t o a s p e cific file s ys t e m . Fo r in s t a n ce , a s we s h a ll s e e in Ch a p t e r 1 7 , if t h e in o d e o b je ct re fe rs t o a n Ext 2 file , t h e fie ld s t o re s a n ext2_inode_info s t ru ct u re .

Ea ch in o d e o b je ct d u p lica t e s s o m e o f t h e d a t a in clu d e d in t h e d is k in o d e —fo r in s t a n ce , t h e n u m b e r o f b lo cks a llo ca t e d t o t h e file . Wh e n t h e va lu e o f t h e i_state fie ld is e q u a l t o

I_DIRTY_SYNC, I_DIRTY_DATASYNC, o r I_DIRTY_PAGES, t h e in o d e is d irt y—t h a t is , t h e co rre s p o n d in g d is k in o d e m u s t b e u p d a t e d ; t h e I_DIRTY m a cro ca n b e u s e d t o ch e ck t h e va lu e o f t h e s e t h re e fla g s a t o n ce ( s e e la t e r fo r d e t a ils ) . Ot h e r va lu e s o f t h e i_state fie ld a re I_LOCK ( t h e in o d e o b je ct is in vo lve d in a I/ O t ra n s fe r) , I_FREEING ( t h e in o d e o b je ct is b e in g fre e d ) , a n d I_CLEAR ( t h e in o d e o b je ct co n t e n t s a re n o lo n g e r m e a n in g fu l) . Ea ch in o d e o b je ct a lwa ys a p p e a rs in o n e o f t h e fo llo win g circu la r d o u b ly lin ke d lis t s : ●

Th e lis t o f va lid u n u s e d in o d e s , t yp ica lly t h o s e m irro rin g va lid d is k in o d e s a n d n o t cu rre n t ly u s e d b y a n y p ro ce s s . Th e s e in o d e s a re n o t d irt y a n d t h e ir i_count fie ld is s e t t o 0 . Th e firs t a n d la s t e le m e n t s o f t h is lis t a re re fe re n ce d b y t h e next a n d prev fie ld s , re s p e ct ive ly, o f t h e inode_unused va ria b le . Th is lis t a ct s a s a d is k ca ch e .



Th e lis t o f in - u s e in o d e s , t yp ica lly t h o s e m irro rin g va lid d is k in o d e s a n d u s e d b y s o m e p ro ce s s . Th e s e in o d e s a re n o t d irt y a n d t h e ir i_count fie ld is p o s it ive . Th e firs t a n d la s t e le m e n t s a re re fe re n ce d b y t h e inode_in_use va ria b le .



Th e lis t o f d irt y in o d e s . Th e firs t a n d la s t e le m e n t s a re re fe re n ce d b y t h e s_dirty fie ld o f t h e co rre s p o n d in g s u p e rb lo ck o b je ct .

Ea ch o f t h e lis t s ju s t m e n t io n e d lin ks t h e i_list fie ld s o f t h e p ro p e r in o d e o b je ct s .

in o d e o b je ct s a re a ls o in clu d e d in a h a s h t a b le n a m e d inode_hashtable. Th e h a s h t a b le s p e e d s u p t h e s e a rch o f t h e in o d e o b je ct wh e n t h e ke rn e l kn o ws b o t h t h e in o d e n u m b e r a n d t h e a d d re s s o f t h e s u p e rb lo ck o b je ct co rre s p o n d in g t o t h e file s ys t e m t h a t in clu d e s t h e file . [ 4 ] S in ce h a s h in g m a y in d u ce co llis io n s , t h e in o d e o b je ct in clu d e s a n i_hash fie ld t h a t co n t a in s a b a ckwa rd a n d a fo rwa rd p o in t e r t o o t h e r in o d e s t h a t h a s h t o t h e s a m e p o s it io n ; t h is fie ld cre a t e s a d o u b ly lin ke d lis t o f t h o s e in o d e s . Th e h a s h t a b le a ls o in clu d e s a s p e cia l ch a in lis t fo r t h e in o d e s n o t a s s ig n e d t o a s u p e rb lo ck ( s u ch a s t h e in o d e s u s e d b y s o cke t s ; s e e Ch a p t e r 1 8 ) ; it s firs t a n d la s t e le m e n t s a re re fe re n ce d b y t h e anon_hash_chain va ria b le . [4]

Act u a lly, a Un ix p ro ce s s m a y o p e n a file a n d t h e n u n lin k it . Th e i_nlink fie ld o f t h e in o d e co u ld b e co m e 0 , ye t t h e p ro ce s s is s t ill a b le t o a ct o n t h e file . In t h is p a rt icu la r ca s e , t h e in o d e is re m o ve d fro m t h e h a s h t a b le , e ve n if it s t ill b e lo n g s t o t h e in - u s e o r d irt y lis t .

Th e m e t h o d s a s s o cia t e d wit h a n in o d e o b je ct a re a ls o ca lle d in o d e o p e ra t io n s . Th e y a re d e s crib e d b y a n inode_operations s t ru ct u re , wh o s e a d d re s s is in clu d e d in t h e i_op fie ld . He re a re t h e in o d e o p e ra t io n s in t h e o rd e r t h e y a p p e a r in t h e inode_operations t a b le :

create(dir, dentry, mode) Cre a t e s a n e w d is k in o d e fo r a re g u la r file a s s o cia t e d wit h a d e n t ry o b je ct in s o m e d ire ct o ry.

lookup(dir, dentry) S e a rch e s a d ire ct o ry fo r a n in o d e co rre s p o n d in g t o t h e file n a m e in clu d e d in a d e n t ry o b je ct .

link(old_dentry, dir, new_dentry) Cre a t e s a n e w h a rd lin k t h a t re fe rs t o t h e file s p e cifie d b y old_dentry in t h e d ire ct o ry dir; t h e n e w h a rd lin k h a s t h e n a m e s p e cifie d b y new_dentry.

unlink(dir, dentry) Re m o ve s t h e h a rd lin k o f t h e file s p e cifie d b y a d e n t ry o b je ct fro m a d ire ct o ry.

symlink(dir, dentry, symname) Cre a t e s a n e w in o d e fo r a s ym b o lic lin k a s s o cia t e d wit h a d e n t ry o b je ct in s o m e d ire ct o ry.

mkdir(dir, dentry, mode) Cre a t e s a n e w in o d e fo r a d ire ct o ry a s s o cia t e d wit h a d e n t ry o b je ct in s o m e d ire ct o ry.

rmdir(dir, dentry) Re m o ve s fro m a d ire ct o ry t h e s u b d ire ct o ry wh o s e n a m e is in clu d e d in a d e n t ry o b je ct .

mknod(dir, dentry, mode, rdev) Cre a t e s a n e w d is k in o d e fo r a s p e cia l file a s s o cia t e d wit h a d e n t ry o b je ct in s o m e d ire ct o ry. Th e mode a n d rdev p a ra m e t e rs s p e cify, re s p e ct ive ly, t h e file t yp e a n d t h e d e vice 's m a jo r n u m b e r.

rename(old_dir, old_dentry, new_dir, new_dentry) Mo ve s t h e file id e n t ifie d b y old_entry fro m t h e old_dir d ire ct o ry t o t h e new_dir

o n e . Th e n e w file n a m e is in clu d e d in t h e d e n t ry o b je ct t h a t new_dentry p o in t s t o .

readlink(dentry, buffer, buflen) Co p ie s in t o a m e m o ry a re a s p e cifie d b y buffer t h e file p a t h n a m e co rre s p o n d in g t o t h e s ym b o lic lin k s p e cifie d b y t h e d e n t ry.

follow_link(inode, dir) Tra n s la t e s a s ym b o lic lin k s p e cifie d b y a n in o d e o b je ct ; if t h e s ym b o lic lin k is a re la t ive p a t h n a m e , t h e lo o ku p o p e ra t io n s t a rt s fro m t h e s p e cifie d d ire ct o ry.

truncate(inode) Mo d ifie s t h e s ize o f t h e file a s s o cia t e d wit h a n in o d e . Be fo re in vo kin g t h is m e t h o d , it is n e ce s s a ry t o s e t t h e i_size fie ld o f t h e in o d e o b je ct t o t h e re q u ire d n e w s ize .

permission(inode, mask) Ch e cks wh e t h e r t h e s p e cifie d a cce s s m o d e is a llo we d fo r t h e file a s s o cia t e d wit h inode.

revalidate(dentry) Up d a t e s t h e ca ch e d a t t rib u t e s o f a file s p e cifie d b y a d e n t ry o b je ct ( u s u a lly in vo ke d b y t h e n e t wo rk file s ys t e m ) .

setattr(dentry, iattr) No t ifie s a "ch a n g e e ve n t " a ft e r t o u ch in g t h e in o d e a t t rib u t e s .

getattr(dentry, iattr) Us e d b y n e t wo rkin g file s ys t e m s wh e n n o t icin g t h a t s o m e ca ch e d in o d e a t t rib u t e s m u s t b e re fre s h e d . Th e m e t h o d s ju s t lis t e d a re a va ila b le t o a ll p o s s ib le in o d e s a n d file s ys t e m t yp e s . Ho we ve r, o n ly a s u b s e t o f t h e m a p p lie s t o a s p e cific in o d e a n d file s ys t e m ; t h e fie ld s co rre s p o n d in g t o u n im p le m e n t e d m e t h o d s a re s e t t o NULL.

12.2.3 File Objects A file o b je ct d e s crib e s h o w a p ro ce s s in t e ra ct s wit h a file it h a s o p e n e d . Th e o b je ct is cre a t e d wh e n t h e file is o p e n e d a n d co n s is t s o f a file s t ru ct u re , wh o s e fie ld s a re d e s crib e d in Ta b le 1 2 - 4 . No t ice t h a t file o b je ct s h a ve n o co rre s p o n d in g im a g e o n d is k, a n d h e n ce n o "d irt y" fie ld is in clu d e d in t h e file s t ru ct u re t o s p e cify t h a t t h e file o b je ct h a s b e e n m o d ifie d .

Ta b le 1 2 - 4 . Th e fie ld s o f t h e file o b je c t

Ty p e

Fie ld

D e s c rip t io n

struct list_head

f_list

Po in t e rs fo r g e n e ric file o b je ct lis t

struct dentry *

f_dentry

d e n t ry o b je ct a s s o cia t e d wit h t h e file

struct vfsmount *

f_vfsmnt

Mo u n t e d file s ys t e m co n t a in in g t h e file

struct file_operations * f_op

Po in t e r t o file o p e ra t io n t a b le

atomic_t

f_count

File o b je ct 's u s a g e co u n t e r

unsigned int

f_flags

Fla g s s p e cifie d wh e n o p e n in g t h e file

mode_t

f_mode

Pro ce s s a cce s s m o d e

loff_t

f_pos

Cu rre n t file o ffs e t ( file p o in t e r)

unsigned long

f_reada

Re a d - a h e a d fla g

unsigned long

f_ramax

Ma xim u m n u m b e r o f p a g e s t o b e re a d ahead

unsigned long

f_raend

File p o in t e r a ft e r la s t re a d - a h e a d

unsigned long

f_ralen

Nu m b e r o f re a d - a h e a d b yt e s

unsigned long

f_rawin

Nu m b e r o f re a d - a h e a d p a g e s

struct fown_struct

f_owner

Da t a fo r a s yn ch ro n o u s I/ O via s ig n a ls

unsigned int

f_uid

Us e r's UID

unsigned int

f_gid

Us e r's GID

int

f_error

Erro r co d e fo r n e t wo rk writ e o p e ra t io n

Ve rs io n n u m b e r, a u t o m a t ica lly in cre m e n t e d a ft e r e a ch u s e

unsigned long

f_version

void *

private_data Ne e d e d fo r t t y d rive r

struct kiobuf *

f_iobuf

long

f_iobuf_lock Lo ck fo r d ire ct I/ O t ra n s fe r

De s crip t o r fo r d ire ct a cce s s b u ffe r ( s e e S e ct io n 1 5 . 2 )

Th e m a in in fo rm a t io n s t o re d in a file o b je ct is t h e file p o in t e r—t h e cu rre n t p o s it io n in t h e file fro m wh ich t h e n e xt o p e ra t io n will t a ke p la ce . S in ce s e ve ra l p ro ce s s e s m a y a cce s s t h e s a m e file co n cu rre n t ly, t h e file p o in t e r ca n n o t b e ke p t in t h e in o d e o b je ct . Ea ch file o b je ct is a lwa ys in clu d e d in o n e o f t h e fo llo win g circu la r d o u b ly lin ke d lis t s : ●

Th e lis t o f "u n u s e d " file o b je ct s . Th is lis t a ct s b o t h a s a m e m o ry ca ch e fo r t h e file o b je ct s a n d a s a re s e rve fo r t h e s u p e ru s e r; it a llo ws t h e s u p e ru s e r t o o p e n a file e ve n if t h e d yn a m ic m e m o ry in t h e s ys t e m is e xh a u s t e d . S in ce t h e o b je ct s a re u n u s e d , t h e ir f_count fie ld s a re 0 . Th e firs t e le m e n t o f t h e lis t is a d u m m y a n d it is s t o re d in t h e free_list va ria b le . Th e ke rn e l m a ke s s u re t h a t t h e lis t a lwa ys co n t a in s a t le a s t NR_RESERVED_FILES o b je ct s , u s u a lly 1 0 .



Th e lis t o f "in u s e " file o b je ct s n o t ye t a s s ig n e d t o a s u p e rb lo ck. Th e f_count fie ld o f e a ch e le m e n t in t h is lis t is s e t t o 1 . Th e firs t e le m e n t o f t h e lis t is a d u m m y a n d it is s t o re d in t h e anon_list va ria b le .



S e ve ra l lis t s o f "in u s e " file o b je ct s a lre a d y a s s ig n e d t o s u p e rb lo cks . Ea ch s u p e rb lo ck o b je ct s t o re s in t h e s_files fie ld t h e d u m m y firs t e le m e n t o f a lis t o f file o b je ct s ; t h u s , file o b je ct s o f file s b e lo n g in g t o d iffe re n t file s ys t e m s a re in clu d e d in d iffe re n t lis t s . Th e f_count fie ld o f e a ch e le m e n t in s u ch a lis t is s e t t o 1 p lu s t h e n u m b e r o f p ro ce s s e s t h a t a re u s in g t h e file o b je ct .

Re g a rd le s s o f wh ich lis t a file o b je ct is in a t t h e m o m e n t , t h e p o in t e rs o f t h e n e xt a n d p re vio u s e le m e n t s in t h e lis t a re s t o re d in t h e f_list fie ld o f t h e file o b je ct . Th e

files_lock s e m a p h o re p ro t e ct s t h e lis t s a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e s ize o f t h e lis t o f "u n u s e d " file o b je ct s is s t o re d in t h e nr_free_files fie ld o f t h e

files_stat va ria b le . Th e get_empty_filp( ) fu n ct io n is in vo ke d wh e n t h e VFS m u s t a llo ca t e a n e w file o b je ct . Th e fu n ct io n ch e cks wh e t h e r t h e "u n u s e d " lis t h a s m o re t h a n NR_RESERVED_FILES it e m s , in wh ich ca s e o n e ca n b e u s e d fo r t h e n e wly o p e n e d file . Ot h e rwis e , it fa lls b a ck t o n o rm a l m e m o ry a llo ca t io n . Th e files_stat va ria b le a ls o in clu d e s t h e nr_files fie ld ( wh ich s t o re s t h e n u m b e r o f file o b je ct s in clu d e d in a ll lis t s ) a n d t h e max_files fie ld ( wh ich is t h e m a xim u m n u m b e r o f a llo ca t a b le file o b je ct s —i. e . , t h e m a xim u m n u m b e r o f file s t h a t ca n b e a cce s s e d a t t h e s a m e t im e in t h e s ys t e m ) . [ 5 ] [5]

By d e fa u lt , max_files s t o re s t h e va lu e 8 , 1 9 2 , b u t t h e s ys t e m

a d m in is t ra t o r ca n t u n e t h is p a ra m e t e r b y writ in g in t o t h e / p ro c/ s y s / fs / file - m a x file . As we e xp la in e d e a rlie r in S e ct io n 1 2 . 1 . 1 , e a ch file s ys t e m in clu d e s it s o wn s e t o f file o p e ra t io n s t h a t p e rfo rm s u ch a ct ivit ie s a s re a d in g a n d writ in g a file . Wh e n t h e ke rn e l lo a d s a n in o d e in t o m e m o ry fro m d is k, it s t o re s a p o in t e r t o t h e s e file o p e ra t io n s in a file_operations s t ru ct u re wh o s e a d d re s s is co n t a in e d in t h e i_fop fie ld o f t h e in o d e o b je ct . Wh e n a p ro ce s s o p e n s t h e file , t h e VFS in it ia lize s t h e f_op fie ld o f t h e n e w file o b je ct wit h t h e a d d re s s s t o re d in t h e in o d e s o t h a t fu rt h e r ca lls t o file o p e ra t io n s ca n u s e t h e s e fu n ct io n s . If n e ce s s a ry, t h e VFS m a y la t e r m o d ify t h e s e t o f file o p e ra t io n s b y s t o rin g a n e w va lu e in f_op.

Th e fo llo win g lis t d e s crib e s t h e file o p e ra t io n s in t h e o rd e r in wh ich t h e y a p p e a r in t h e file_operations t a b le :

llseek(file, offset, origin) Up d a t e s t h e file p o in t e r.

read(file, buf, count, offset) Re a d s count b yt e s fro m a file s t a rt in g a t p o s it io n *offset; t h e va lu e *offset ( wh ich u s u a lly co rre s p o n d s t o t h e file p o in t e r) is t h e n in cre m e n t e d .

write(file, buf, count, offset) Writ e s count b yt e s in t o a file s t a rt in g a t p o s it io n *offset; t h e va lu e *offset ( wh ich u s u a lly co rre s p o n d s t o t h e file p o in t e r) is t h e n in cre m e n t e d .

readdir(dir, dirent, filldir) Re t u rn s t h e n e xt d ire ct o ry e n t ry o f a d ire ct o ry in dirent; t h e filldir p a ra m e t e r co n t a in s t h e a d d re s s o f a n a u xilia ry fu n ct io n t h a t e xt ra ct s t h e fie ld s in a d ire ct o ry e n t ry.

poll(file, poll_table) Ch e cks wh e t h e r t h e re is a ct ivit y o n a file a n d g o e s t o s le e p u n t il s o m e t h in g h a p p e n s o n it .

ioctl(inode, file, cmd, arg) S e n d s a co m m a n d t o a n u n d e rlyin g h a rd wa re d e vice . Th is m e t h o d a p p lie s o n ly t o d e vice file s .

mmap(file, vma) Pe rfo rm s a m e m o ry m a p p in g o f t h e file in t o a p ro ce s s a d d re s s s p a ce ( s e e Ch a p t e r

15).

open(inode, file) Op e n s a file b y cre a t in g a n e w file o b je ct a n d lin kin g it t o t h e co rre s p o n d in g in o d e o b je ct ( s e e S e ct io n 1 2 . 6 . 1 la t e r in t h is ch a p t e r) .

flush(file) Ca lle d wh e n a re fe re n ce t o a n o p e n file is clo s e d —t h a t is , wh e n t h e f_count fie ld o f t h e file o b je ct is d e cre m e n t e d . Th e a ct u a l p u rp o s e o f t h is m e t h o d is file s ys t e m de pe nde nt.

release(inode, file) Re le a s e s t h e file o b je ct . Ca lle d wh e n t h e la s t re fe re n ce t o a n o p e n file is clo s e d —t h a t is , wh e n t h e f_count fie ld o f t h e file o b je ct b e co m e s 0 .

fsync(file, dentry) Writ e s a ll ca ch e d d a t a o f t h e file t o d is k.

fasync(fd, file, on) En a b le s o r d is a b le s a s yn ch ro n o u s I/ O n o t ifica t io n b y m e a n s o f s ig n a ls .

lock(file, cmd, file_lock) Ap p lie s a lo ck t o t h e file ( s e e S e ct io n 1 2 . 7 la t e r in t h is ch a p t e r) .

readv(file, vector, count, offset) Re a d s b yt e s fro m a file a n d p u t s t h e re s u lt s in t h e b u ffe rs d e s crib e d b y vector; t h e n u m b e r o f b u ffe rs is s p e cifie d b y count.

writev(file, vector, count, offset) Writ e s b yt e s in t o a file fro m t h e b u ffe rs d e s crib e d b y vector; t h e n u m b e r o f b u ffe rs is s p e cifie d b y count.

sendpage(file, page, offset, size, pointer, fill) Tra n s fe rs d a t a fro m t h is file t o a n o t h e r file ; t h is m e t h o d is u s e d b y s o cke t s ( s e e Ch a p t e r 1 8 ) .

get_unmapped_area(file, addr, len, offset, flags)

Ge t s a n u n u s e d a d d re s s ra n g e t o m a p t h e file ( u s e d fo r fra m e b u ffe r m e m o ry m a p p in g s ) . Th e m e t h o d s ju s t d e s crib e d a re a va ila b le t o a ll p o s s ib le file t yp e s . Ho we ve r, o n ly a s u b s e t o f t h e m a p p ly t o a s p e cific file t yp e ; t h e fie ld s co rre s p o n d in g t o u n im p le m e n t e d m e t h o d s a re s e t t o NULL.

12.2.4 dentry Objects We m e n t io n e d in S e ct io n 1 2 . 1 . 1 t h a t t h e VFS co n s id e rs e a ch d ire ct o ry a file t h a t co n t a in s a lis t o f file s a n d o t h e r d ire ct o rie s . We s h a ll d is cu s s in Ch a p t e r 1 7 h o w d ire ct o rie s a re im p le m e n t e d o n a s p e cific file s ys t e m . On ce a d ire ct o ry e n t ry is re a d in t o m e m o ry, h o we ve r, it is t ra n s fo rm e d b y t h e VFS in t o a d e n t ry o b je ct b a s e d o n t h e dentry s t ru ct u re , wh o s e fie ld s a re d e s crib e d in Ta b le 1 2 - 5 . Th e ke rn e l cre a t e s a d e n t ry o b je ct fo r e ve ry co m p o n e n t o f a p a t h n a m e t h a t a p ro ce s s lo o ks u p ; t h e d e n t ry o b je ct a s s o cia t e s t h e co m p o n e n t t o it s co rre s p o n d in g in o d e . Fo r e xa m p le , wh e n lo o kin g u p t h e / t m p / t e s t p a t h n a m e , t h e ke rn e l cre a t e s a d e n t ry o b je ct fo r t h e / ro o t d ire ct o ry, a s e co n d d e n t ry o b je ct fo r t h e t m p e n t ry o f t h e ro o t d ire ct o ry, a n d a t h ird d e n t ry o b je ct fo r t h e t e s t e n t ry o f t h e / t m p d ire ct o ry. No t ice t h a t d e n t ry o b je ct s h a ve n o co rre s p o n d in g im a g e o n d is k, a n d h e n ce n o fie ld is in clu d e d in t h e dentry s t ru ct u re t o s p e cify t h a t t h e o b je ct h a s b e e n m o d ifie d . De n t ry o b je ct s a re s t o re d in a s la b a llo ca t o r ca ch e ca lle d dentry_cache; d e n t ry o b je ct s a re t h u s cre a t e d a n d d e s t ro ye d b y in vo kin g kmem_cache_alloc( ) a n d kmem_cache_free( ).

Ta b le 1 2 - 5 . Th e fie ld s o f t h e d e n t ry o b je c t

Ty p e

Fie ld

D e s c rip t io n

atomic_t

d_count

De n t ry o b je ct u s a g e co u n t e r

unsigned int

d_flags

De n t ry fla g s

struct inode *

d_inode

In o d e a s s o cia t e d wit h file n a m e

struct dentry *

d_parent

De n t ry o b je ct o f p a re n t d ire ct o ry

struct list_head

d_hash

Po in t e rs fo r lis t in h a s h t a b le e n t ry

struct list_head

d_lru

Po in t e rs fo r u n u s e d lis t

struct list_head

d_child

Po in t e rs fo r t h e lis t o f d e n t ry o b je ct s in clu d e d in p a re n t d ire ct o ry

struct list_head

d_subdirs

Fo r d ire ct o rie s , lis t o f d e n t ry o b je ct s o f s u b d ire ct o rie s

struct list_head

d_alias

Lis t o f a s s o cia t e d in o d e s ( a lia s )

int

d_mounted

Fla g s e t t o 1 if a n d o n ly if t h e d e n t ry is t h e m o u n t p o in t fo r a file s ys t e m

struct qstr

d_name

File n a m e

unsigned long

d_time

Us e d b y d_revalidate m e t h o d

struct dentry_operations* d_op

De n t ry m e t h o d s

struct super_block *

d_sb

S u p e rb lo ck o b je ct o f t h e file

unsigned long

d_vfs_flags De n t ry ca ch e fla g s

void *

d_fsdata

File s ys t e m - d e p e n d e n t d a t a

unsigned char *

d_iname

S p a ce fo r s h o rt file n a m e

Ea ch d e n t ry o b je ct m a y b e in o n e o f fo u r s t a t e s : Fre e Th e d e n t ry o b je ct co n t a in s n o va lid in fo rm a t io n a n d is n o t u s e d b y t h e VFS . Th e co rre s p o n d in g m e m o ry a re a is h a n d le d b y t h e s la b a llo ca t o r. Un u s e d Th e d e n t ry o b je ct is n o t cu rre n t ly u s e d b y t h e ke rn e l. Th e d_count u s a g e co u n t e r o f t h e o b je ct is 0 , b u t t h e d_inode fie ld s t ill p o in t s t o t h e a s s o cia t e d in o d e . Th e d e n t ry o b je ct co n t a in s va lid in fo rm a t io n , b u t it s co n t e n t s m a y b e d is ca rd e d if n e ce s s a ry in o rd e r t o re cla im m e m o ry. In u s e Th e d e n t ry o b je ct is cu rre n t ly u s e d b y t h e ke rn e l. Th e d_count u s a g e co u n t e r is p o s it ive a n d t h e d_inode fie ld p o in t s t o t h e a s s o cia t e d in o d e o b je ct . Th e d e n t ry o b je ct co n t a in s va lid in fo rm a t io n a n d ca n n o t b e d is ca rd e d . Ne g a t iv e

Th e in o d e a s s o cia t e d wit h t h e d e n t ry d o e s n o t e xis t , e it h e r b e ca u s e t h e co rre s p o n d in g d is k in o d e h a s b e e n d e le t e d o r b e ca u s e t h e d e n t ry o b je ct wa s cre a t e d b y re s o lvin g a p a t h n a m e o f a n o n e xis t in g file . Th e d_inode fie ld o f t h e d e n t ry o b je ct is s e t t o NULL, b u t t h e o b je ct s t ill re m a in s in t h e d e n t ry ca ch e s o t h a t fu rt h e r lo o ku p o p e ra t io n s t o t h e s a m e file p a t h n a m e ca n b e q u ickly re s o lve d . Th e t e rm "n e g a t ive " is m is le a d in g s in ce n o n e g a t ive va lu e is in vo lve d .

12.2.5 The dentry Cache S in ce re a d in g a d ire ct o ry e n t ry fro m d is k a n d co n s t ru ct in g t h e co rre s p o n d in g d e n t ry o b je ct re q u ire s co n s id e ra b le t im e , it m a ke s s e n s e t o ke e p in m e m o ry d e n t ry o b je ct s t h a t yo u 've fin is h e d wit h b u t m ig h t n e e d la t e r. Fo r in s t a n ce , p e o p le o ft e n e d it a file a n d t h e n co m p ile it , o r e d it a n d p rin t it , o r co p y it a n d t h e n e d it t h e co p y. In s u ch ca s e s , t h e s a m e file n e e d s t o b e re p e a t e d ly a cce s s e d . To m a xim ize e fficie n cy in h a n d lin g d e n t rie s , Lin u x u s e s a d e n t ry ca ch e , wh ich co n s is t s o f t wo kin d s o f d a t a s t ru ct u re s : ● ●

A s e t o f d e n t ry o b je ct s in t h e in - u s e , u n u s e d , o r n e g a t ive s t a t e . A h a s h t a b le t o d e rive t h e d e n t ry o b je ct a s s o cia t e d wit h a g ive n file n a m e a n d a g ive n d ire ct o ry q u ickly. As u s u a l, if t h e re q u ire d o b je ct is n o t in clu d e d in t h e d e n t ry ca ch e , t h e h a s h in g fu n ct io n re t u rn s a n u ll va lu e .

Th e d e n t ry ca ch e a ls o a ct s a s a co n t ro lle r fo r a n in o d e ca ch e . Th e in o d e s in ke rn e l m e m o ry t h a t a re a s s o cia t e d wit h u n u s e d d e n t rie s a re n o t d is ca rd e d , s in ce t h e d e n t ry ca ch e is s t ill u s in g t h e m . Th u s , t h e in o d e o b je ct s a re ke p t in RAM a n d ca n b e q u ickly re fe re n ce d b y m e a n s o f t h e co rre s p o n d in g d e n t rie s . All t h e "u n u s e d " d e n t rie s a re in clu d e d in a d o u b ly lin ke d "Le a s t Re ce n t ly Us e d " lis t s o rt e d b y t im e o f in s e rt io n . In o t h e r wo rd s , t h e d e n t ry o b je ct t h a t wa s la s t re le a s e d is p u t in fro n t o f t h e lis t , s o t h e le a s t re ce n t ly u s e d d e n t ry o b je ct s a re a lwa ys n e a r t h e e n d o f t h e lis t . Wh e n t h e d e n t ry ca ch e h a s t o s h rin k, t h e ke rn e l re m o ve s e le m e n t s fro m t h e t a il o f t h is lis t s o t h a t t h e m o s t re ce n t ly u s e d o b je ct s a re p re s e rve d . Th e a d d re s s e s o f t h e firs t a n d la s t e le m e n t s o f t h e LRU lis t a re s t o re d in t h e next a n d prev fie ld s o f t h e dentry_unused va ria b le . Th e

d_lru fie ld o f t h e d e n t ry o b je ct co n t a in s p o in t e rs t o t h e a d ja ce n t d e n t rie s in t h e lis t . Ea ch "in u s e " d e n t ry o b je ct is in s e rt e d in t o a d o u b ly lin ke d lis t s p e cifie d b y t h e i_dentry fie ld o f t h e co rre s p o n d in g in o d e o b je ct ( s in ce e a ch in o d e co u ld b e a s s o cia t e d wit h s e ve ra l h a rd lin ks , a lis t is re q u ire d ) . Th e d_alias fie ld o f t h e d e n t ry o b je ct s t o re s t h e a d d re s s e s o f t h e a d ja ce n t e le m e n t s in t h e lis t . Bo t h fie ld s a re o f t yp e struct list_head.

An "in u s e " d e n t ry o b je ct m a y b e co m e "n e g a t ive " wh e n t h e la s t h a rd lin k t o t h e co rre s p o n d in g file is d e le t e d . In t h is ca s e , t h e d e n t ry o b je ct is m o ve d in t o t h e LRU lis t o f u n u s e d d e n t rie s . Ea ch t im e t h e ke rn e l s h rin ks t h e d e n t ry ca ch e , n e g a t ive d e n t rie s m o ve t o wa rd t h e t a il o f t h e LRU lis t s o t h a t t h e y a re g ra d u a lly fre e d ( s e e S e ct io n 1 6 . 7 . 6 ) . Th e h a s h t a b le is im p le m e n t e d b y m e a n s o f a dentry_hashtable a rra y. Ea ch e le m e n t is a p o in t e r t o a lis t o f d e n t rie s t h a t h a s h t o t h e s a m e h a s h t a b le va lu e . Th e a rra y's s ize d e p e n d s o n t h e a m o u n t o f RAM in s t a lle d in t h e s ys t e m . Th e d_hash fie ld o f t h e d e n t ry o b je ct co n t a in s p o in t e rs t o t h e a d ja ce n t e le m e n t s in t h e lis t a s s o cia t e d wit h a s in g le h a s h va lu e . Th e h a s h fu n ct io n p ro d u ce s it s va lu e fro m b o t h t h e a d d re s s o f t h e d e n t ry o b je ct o f t h e d ire ct o ry a n d t h e file n a m e .

Th e dcache_lock s p in lo ck p ro t e ct s t h e d e n t ry ca ch e d a t a s t ru ct u re s a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e d_lookup( ) fu n ct io n lo o ks in t h e h a s h t a b le fo r a g ive n p a re n t d e n t ry o b je ct a n d file n a m e . Th e m e t h o d s a s s o cia t e d wit h a d e n t ry o b je ct a re ca lle d d e n t ry o p e ra t io n s ; t h e y a re d e s crib e d b y t h e dentry_operations s t ru ct u re , wh o s e a d d re s s is s t o re d in t h e d_op fie ld . Alt h o u g h s o m e file s ys t e m s d e fin e t h e ir o wn d e n t ry m e t h o d s , t h e fie ld s a re u s u a lly NULL a n d t h e VFS re p la ce s t h e m wit h d e fa u lt fu n ct io n s . He re a re t h e m e t h o d s , in t h e o rd e r t h e y a p p e a r in t h e dentry_operations t a b le :

d_revalidate(dentry, flag) De t e rm in e s wh e t h e r t h e d e n t ry o b je ct is s t ill va lid b e fo re u s in g it fo r t ra n s la t in g a file p a t h n a m e . Th e d e fa u lt VFS fu n ct io n d o e s n o t h in g , a lt h o u g h n e t wo rk file s ys t e m s m a y s p e cify t h e ir o wn fu n ct io n s .

d_hash(dentry, name) Cre a t e s a h a s h va lu e ; t h is fu n ct io n is a file s ys t e m - s p e cific h a s h fu n ct io n fo r t h e d e n t ry h a s h t a b le . Th e dentry p a ra m e t e r id e n t ifie s t h e d ire ct o ry co n t a in in g t h e co m p o n e n t . Th e name p a ra m e t e r p o in t s t o a s t ru ct u re co n t a in in g b o t h t h e p a t h n a m e co m p o n e n t t o b e lo o ke d u p a n d t h e va lu e p ro d u ce d b y t h e h a s h fu n ct io n .

d_compare(dir, name1, name2) Co m p a re s t wo file n a m e s ; name1 s h o u ld b e lo n g t o t h e d ire ct o ry re fe re n ce d b y dir. Th e d e fa u lt VFS fu n ct io n is a n o rm a l s t rin g m a t ch . Ho we ve r, e a ch file s ys t e m ca n im p le m e n t t h is m e t h o d in it s o wn wa y. Fo r in s t a n ce , MS - DOS d o e s n o t d is t in g u is h ca p it a l fro m lo we rca s e le t t e rs .

d_delete(dentry) Ca lle d wh e n t h e la s t re fe re n ce t o a d e n t ry o b je ct is d e le t e d ( d_count b e co m e s 0 ) . Th e d e fa u lt VFS fu n ct io n d o e s n o t h in g .

d_release(dentry) Ca lle d wh e n a d e n t ry o b je ct is g o in g t o b e fre e d ( re le a s e d t o t h e s la b a llo ca t o r) . Th e d e fa u lt VFS fu n ct io n d o e s n o t h in g .

d_iput(dentry, ino) Ca lle d wh e n a d e n t ry o b je ct b e co m e s "n e g a t ive "—t h a t is , it lo s e s it s in o d e . Th e d e fa u lt VFS fu n ct io n in vo ke s iput( ) t o re le a s e t h e in o d e o b je ct .

12.2.6 Files Associated with a Process

We m e n t io n e d in S e ct io n 1 . 5 t h a t e a ch p ro ce s s h a s it s o wn cu rre n t wo rkin g d ire ct o ry a n d it s o wn ro o t d ire ct o ry. Th e s e a re ju s t t wo e xa m p le s o f d a t a t h a t m u s t b e m a in t a in e d b y t h e ke rn e l t o re p re s e n t t h e in t e ra ct io n s b e t we e n a p ro ce s s a n d a file s ys t e m . A wh o le d a t a s t ru ct u re o f t yp e fs_struct is u s e d fo r t h a t p u rp o s e ( s e e Ta b le 1 2 - 6 ) a n d e a ch p ro ce s s d e s crip t o r h a s a n fs fie ld t h a t p o in t s t o t h e p ro ce s s fs_struct s t ru ct u re .

Ta b le 1 2 - 6 . Th e fie ld s o f t h e fs _ s t ru c t s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

atomic_t

count

Nu m b e r o f p ro ce s s e s s h a rin g t h is t a b le

rwlock_t

lock

Re a d / writ e s p in lo ck fo r t h e t a b le fie ld s

int

umask

Bit m a s k u s e d wh e n o p e n in g t h e file t o s e t t h e file p e rm is s io n s

struct dentry *

root

De n t ry o f t h e ro o t d ire ct o ry

struct dentry *

pwd

De n t ry o f t h e cu rre n t wo rkin g d ire ct o ry

struct dentry *

altroot

De n t ry o f t h e e m u la t e d ro o t d ire ct o ry ( a lwa ys NULL fo r t h e 8 0 x 8 6 a rch it e ct u re )

struct vfsmount * rootmnt

Mo u n t e d file s ys t e m o b je ct o f t h e ro o t d ire ct o ry

struct vfsmount * pwdmnt

Mo u n t e d file s ys t e m o b je ct o f t h e cu rre n t wo rkin g d ire ct o ry

struct vfsmount * altrootmnt

Mo u n t e d file s ys t e m o b je ct o f t h e e m u la t e d ro o t d ire ct o ry ( a lwa ys NULL fo r t h e 8 0 x 8 6 a rch it e ct u re )

A s e co n d t a b le , wh o s e a d d re s s is co n t a in e d in t h e files fie ld o f t h e p ro ce s s d e s crip t o r, s p e cifie s wh ich file s a re cu rre n t ly o p e n e d b y t h e p ro ce s s . It is a files_struct s t ru ct u re wh o s e fie ld s a re illu s t ra t e d in Ta b le 1 2 - 7 .

Ta b le 1 2 - 7 . Th e fie ld s o f t h e file s _ s t ru c t s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

atomic_t

count

Nu m b e r o f p ro ce s s e s s h a rin g t h is t a b le

rwlock_t

file_lock

Re a d / writ e s p in lo ck fo r t h e t a b le fie ld s

int

max_fds

Cu rre n t m a xim u m n u m b e r o f file o b je ct s

int

max_fdset

Cu rre n t m a xim u m n u m b e r o f file d e s crip t o rs

int

next_fd

Ma xim u m file d e s crip t o rs e ve r a llo ca t e d p lu s 1

struct file ** fd

fd_set *

close_on_exec

Po in t e r t o a rra y o f file o b je ct p o in t e rs

Po in t e r t o file d e s crip t o rs t o b e clo s e d o n exec(

)

fd_set *

open_fds

fd_set

close_on_exec_init

fd_set

open_fds_init

struct file ** fd_array

Po in t e r t o o p e n file d e s crip t o rs

In it ia l s e t o f file d e s crip t o rs t o b e clo s e d o n

exec( )

In it ia l s e t o f file d e s crip t o rs

In it ia l a rra y o f file o b je ct p o in t e rs

Th e fd fie ld p o in t s t o a n a rra y o f p o in t e rs t o file o b je ct s . Th e s ize o f t h e a rra y is s t o re d in t h e max_fds fie ld . Us u a lly, fd p o in t s t o t h e fd_array fie ld o f t h e files_struct s t ru ct u re , wh ich in clu d e s 3 2 file o b je ct p o in t e rs . If t h e p ro ce s s o p e n s m o re t h a n 3 2 file s , t h e ke rn e l a llo ca t e s a n e w, la rg e r a rra y o f file p o in t e rs a n d s t o re s it s a d d re s s in t h e fd fie ld s ; it a ls o u p d a t e s t h e max_fds fie ld .

Fo r e ve ry file wit h a n e n t ry in t h e fd a rra y, t h e a rra y in d e x is t h e file d e s crip t o r. Us u a lly, t h e firs t e le m e n t ( in d e x 0 ) o f t h e a rra y is a s s o cia t e d wit h t h e s t a n d a rd in p u t o f t h e p ro ce s s , t h e s e co n d wit h t h e s t a n d a rd o u t p u t , a n d t h e t h ird wit h t h e s t a n d a rd e rro r ( s e e Fig u re 1 2 - 3 ) . Un ix p ro ce s s e s u s e t h e file d e s crip t o r a s t h e m a in file id e n t ifie r. No t ice t h a t , t h a n ks t o t h e dup( ), dup2( ), a n d fcntl( ) s ys t e m ca lls , t wo file d e s crip t o rs m a y re fe r t o t h e s a m e o p e n e d file —t h a t is , t wo e le m e n t s o f t h e a rra y co u ld p o in t t o t h e s a m e file o b je ct . Us e rs s e e t h is a ll t h e t im e wh e n t h e y u s e s h e ll co n s t ru ct s like 2>&1 t o re d ire ct t h e s t a n d a rd e rro r t o t h e s t a n d a rd o u t p u t . A p ro ce s s ca n n o t u s e m o re t h a n NR_OPEN ( u s u a lly, 1 , 0 4 8 , 5 7 6 ) file d e s crip t o rs . Th e ke rn e l a ls o e n fo rce s a d yn a m ic b o u n d o n t h e m a xim u m n u m b e r o f file d e s crip t o rs in t h e rlim[RLIMIT_NOFILE] s t ru ct u re o f t h e p ro ce s s d e s crip t o r; t h is va lu e is u s u a lly 1 , 0 2 4 , b u t it ca n b e ra is e d if t h e p ro ce s s h a s ro o t p rivile g e s .

Th e open_fds fie ld in it ia lly co n t a in s t h e a d d re s s o f t h e open_fds_init fie ld , wh ich is a b it m a p t h a t id e n t ifie s t h e file d e s crip t o rs o f cu rre n t ly o p e n e d file s . Th e max_fdset fie ld s t o re s t h e n u m b e r o f b it s in t h e b it m a p . S in ce t h e fd_set d a t a s t ru ct u re in clu d e s 1 , 0 2 4 b it s , t h e re is u s u a lly n o n e e d t o e xp a n d t h e s ize o f t h e b it m a p . Ho we ve r, t h e ke rn e l m a y d yn a m ica lly e xp a n d t h e s ize o f t h e b it m a p if t h is t u rn s o u t t o b e n e ce s s a ry, m u ch a s in t h e ca s e o f t h e a rra y o f file o b je ct s . Fig u re 1 2 - 3 . Th e fd a rra y

Th e ke rn e l p ro vid e s a n fget( ) fu n ct io n t o b e in vo ke d wh e n t h e ke rn e l s t a rt s u s in g a file o b je ct . Th is fu n ct io n re ce ive s a s it s p a ra m e t e r a file d e s crip t o r fd . It re t u rn s t h e a d d re s s in

current->files->fd[fd] ( t h a t is , t h e a d d re s s o f t h e co rre s p o n d in g file o b je ct ) , o r NULL if n o file co rre s p o n d s t o fd . In t h e firs t ca s e , fget( ) in cre m e n t s t h e file o b je ct u s a g e co u n t e r f_count b y 1 . Th e ke rn e l a ls o p ro vid e s a n fput( ) fu n ct io n t o b e in vo ke d wh e n a ke rn e l co n t ro l p a t h fin is h e s u s in g a file o b je ct . Th is fu n ct io n re ce ive s a s it s p a ra m e t e r t h e a d d re s s o f a file o b je ct a n d d e cre m e n t s it s u s a g e co u n t e r, f_count. Mo re o ve r, if t h is fie ld b e co m e s 0 , t h e fu n ct io n in vo ke s t h e release m e t h o d o f t h e file o p e ra t io n s ( if d e fin e d ) , re le a s e s t h e a s s o cia t e d d e n t ry o b je ct a n d file s ys t e m d e s crip t o r, d e cre m e n t s t h e i_writecount fie ld in t h e in o d e o b je ct ( if t h e file wa s o p e n e d fo r writ in g ) , a n d fin a lly m o ve s t h e file o b je ct fro m t h e "in u s e " lis t t o t h e "u n u s e d " o n e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.3 Filesystem Types Th e Lin u x ke rn e l s u p p o rt s m a n y d iffe re n t t yp e s o f file s ys t e m s . In t h e fo llo win g , we in t ro d u ce a fe w s p e cia l t yp e s o f file s ys t e m s t h a t p la y a n im p o rt a n t ro le in t h e in t e rn a l d e s ig n o f t h e Lin u x ke rn e l. Ne xt , we s h a ll d is cu s s file s ys t e m re g is t ra t io n —t h a t is , t h e b a s ic o p e ra t io n t h a t m u s t b e p e rfo rm e d , u s u a lly d u rin g s ys t e m in it ia liza t io n , b e fo re u s in g a file s ys t e m t yp e . On ce a file s ys t e m is re g is t e re d , it s s p e cific fu n ct io n s a re a va ila b le t o t h e ke rn e l, s o t h a t t yp e o f file s ys t e m ca n b e m o u n t e d o n t h e s ys t e m 's d ire ct o ry t re e .

12.3.1 Special Filesystems Wh ile n e t wo rk a n d d is k- b a s e d file s ys t e m s e n a b le t h e u s e r t o h a n d le in fo rm a t io n s t o re d o u t s id e t h e ke rn e l, s p e cia l file s ys t e m s m a y p ro vid e a n e a s y wa y fo r s ys t e m p ro g ra m s a n d a d m in is t ra t o rs t o m a n ip u la t e t h e d a t a s t ru ct u re s o f t h e ke rn e l a n d t o im p le m e n t s p e cia l fe a t u re s o f t h e o p e ra t in g s ys t e m . Ta b le 1 2 - 8 lis t s t h e m o s t co m m o n s p e cia l file s ys t e m s u s e d in Lin u x; fo r e a ch o f t h e m , t h e t a b le re p o rt s it s m o u n t p o in t a n d a s h o rt d e s crip t io n . No t ice t h a t a fe w file s ys t e m s h a ve n o fixe d m o u n t p o in t ( ke ywo rd "a n y" in t h e t a b le ) . Th e s e file s ys t e m s ca n b e fre e ly m o u n t e d a n d u s e d b y t h e u s e rs . Mo re o ve r, s o m e o t h e r s p e cia l file s ys t e m s d o n o t h a ve a m o u n t p o in t a t a ll ( ke ywo rd "n o n e " in t h e t a b le ) . Th e y a re n o t fo r u s e r in t e ra ct io n , b u t t h e ke rn e l ca n u s e t h e m t o e a s ily re u s e s o m e o f t h e VFS la ye r co d e ; fo r in s t a n ce , we 'll s e e in Ch a p t e r 1 9 t h a t , t h a n ks t o t h e p ip e fs s p e cia l file s ys t e m , p ip e s ca n b e t re a t e d in t h e s a m e wa y a s FIFO file s .

Ta b le 1 2 - 8 . Mo s t c o m m o n s p e c ia l file s y s t e m s

Na m e

Mo u n t p o in t

D e s c rip t io n

bdev

none

Blo ck d e vice s ( s e e Ch a p t e r 1 3 )

b in fm t _ m is c

any

Mis ce lla n e o u s e xe cu t a b le fo rm a t s ( s e e Ch a p t e r 2 0 )

d e v fs

/ dev

Virt u a l d e vice file s ( s e e Ch a p t e r 1 3 )

de vpts

/ de v/ pts

Ps e u d o t e rm in a l s u p p o rt ( Op e n Gro u p 's Un ix9 8 s t a n d a rd )

p ip e fs

none

Pip e s ( s e e Ch a p t e r 1 9 )

p ro c

/ p ro c

Ge n e ra l a cce s s p o in t t o ke rn e l d a t a s t ru ct u re s

ro o t fs

none

Pro vid e s a n e m p t y ro o t d ire ct o ry fo r t h e b o o t s t ra p p h a s e

shm

none

IPC- s h a re d m e m o ry re g io n s ( s e e Ch a p t e r 1 9 )

s o ck fs

none

S o cke t s ( s e e Ch a p t e r 1 8 )

t m p fs

any

Te m p o ra ry file s ( ke p t in RAM u n le s s s wa p p e d )

S p e cia l file s ys t e m s a re n o t b o u n d t o p h ys ica l b lo ck d e vice s . Ho we ve r, t h e ke rn e l a s s ig n s t o e a ch m o u n t e d s p e cia l file s ys t e m a fict it io u s b lo ck d e vice t h a t h a s t h e va lu e 0 a s m a jo r n u m b e r a n d a n a rb it ra ry va lu e ( d iffe re n t fo r e a ch s p e cia l file s ys t e m ) a s a m in o r n u m b e r. Th e get_unnamed_dev( ) fu n ct io n re t u rn s a n e w fict it io u s b lo ck d e vice id e n t ifie r, wh ile t h e

put_unnamed_dev( ) fu n ct io n re le a s e s it . Th e unnamed_dev_in_use a rra y co n t a in s a m a s k o f 2 5 6 b it s t h a t re co rd wh a t m in o r n u m b e rs a re cu rre n t ly in u s e . Alt h o u g h s o m e ke rn e l d e s ig n e rs d is like t h e fict it io u s b lo ck d e vice id e n t ifie rs , t h e y h e lp t h e ke rn e l t o h a n d le s p e cia l file s ys t e m s a n d re g u la r o n e s in a u n ifo rm wa y. We s e e a p ra ct ica l e xa m p le o f h o w t h e ke rn e l d e fin e s a n d in it ia lize s a s p e cia l file s ys t e m in t h e la t e r s e ct io n S e ct io n 1 2 . 4 . 1 .

12.3.2 Filesystem Type Registration Oft e n , t h e u s e r co n fig u re s Lin u x t o re co g n ize a ll t h e file s ys t e m s n e e d e d wh e n co m p ilin g t h e ke rn e l fo r h e r s ys t e m . Bu t t h e co d e fo r a file s ys t e m a ct u a lly m a y e it h e r b e in clu d e d in t h e ke rn e l im a g e o r d yn a m ica lly lo a d e d a s a m o d u le ( s e e Ap p e n d ix B) . Th e VFS m u s t ke e p t ra ck o f a ll file s ys t e m t yp e s wh o s e co d e is cu rre n t ly in clu d e d in t h e ke rn e l. It d o e s t h is b y p e rfo rm in g file s y s t e m t y p e re g is t ra t io n . Ea ch re g is t e re d file s ys t e m is re p re s e n t e d a s a file_system_type o b je ct wh o s e fie ld s a re illu s t ra t e d in Ta b le 1 2 - 9 .

Ta b le 1 2 - 9 . Th e fie ld s o f t h e file _ s y s t e m _ t y p e o b je c t

Ty p e

Fie ld

D e s c rip t io n

const char *

name

File s ys t e m n a m e

int

fs_flags

File s ys t e m t yp e fla g s

struct super_block *(*)( ) read_super Me t h o d fo r re a d in g s u p e rb lo ck

struct module *

owner

struct file_system_type * next

Po in t e r t o t h e m o d u le im p le m e n t in g t h e file s ys t e m ( s e e Ap p e n d ix B)

Po in t e r t o t h e n e xt lis t e le m e n t

struct list_head

fs_supers He a d o f a lis t o f s u p e rb lo ck o b je ct s

All file s ys t e m - t yp e o b je ct s a re in s e rt e d in t o a s im p ly lin ke d lis t . Th e file_systems va ria b le p o in t s t o t h e firs t it e m , wh ile t h e next fie ld o f t h e s t ru ct u re p o in t s t o t h e n e xt it e m in t h e lis t . Th e file_systems_lock re a d / writ e s p in lo ck p ro t e ct s t h e wh o le lis t a g a in s t co n cu rre n t a cce s s e s . Th e fs_supers fie ld re p re s e n t s t h e h e a d ( firs t d u m m y e le m e n t ) o f a lis t o f s u p e rb lo ck o b je ct s co rre s p o n d in g t o m o u n t e d file s ys t e m s o f t h e g ive n t yp e . Th e b a ckwa rd a n d fo rwa rd lin ks o f a lis t e le m e n t a re s t o re d in t h e s_instances fie ld o f t h e s u p e rb lo ck o b je ct .

Th e read_super fie ld p o in t s t o t h e file s ys t e m - t yp e - d e p e n d a n t fu n ct io n t h a t re a d s t h e s u p e rb lo ck fro m t h e d is k d e vice a n d co p ie s it in t o t h e co rre s p o n d in g s u p e rb lo ck o b je ct . Th e fs_flags fie ld s t o re s s e ve ra l fla g s , wh ich a re lis t e d in Ta b le 1 2 - 1 0 .

Ta b le 1 2 - 1 0 . Th e file s y s t e m t y p e fla g s

Na m e

D e s c rip t io n

FS_REQUIRES_DEV An y file s ys t e m o f t h is t yp e m u s t b e lo ca t e d o n a p h ys ica l d is k d e vice . FS_NO_DCACHE

No lo n g e r u s e d .

FS_NO_PRELIM

No lo n g e r u s e d .

FS_SINGLE

Th e re ca n b e o n ly o n e s u p e rb lo ck o b je ct fo r t h is file s ys t e m t yp e .

FS_NOMOUNT

File s ys t e m h a s n o m o u n t p o in t ( s e e S e ct io n 1 2 . 3 . 1 ) .

FS_LITTER

Pu rg e d e n t ry ca ch e a ft e r u n m o u n t in g ( fo r s p e cia l file s ys t e m s ) .

FS_ODD_RENAME

"Re n a m e " o p e ra t io n s a re "m o ve " o p e ra t io n s ( fo r n e t wo rk file s ys t e m s ) .

Du rin g s ys t e m in it ia liza t io n , t h e register_filesystem( ) fu n ct io n is in vo ke d fo r e ve ry file s ys t e m s p e cifie d a t co m p ile t im e ; t h e fu n ct io n in s e rt s t h e co rre s p o n d in g file_system_type o b je ct in t o t h e file s ys t e m - t yp e lis t .

Th e register_filesystem( ) fu n ct io n is a ls o in vo ke d wh e n a m o d u le im p le m e n t in g a file s ys t e m is lo a d e d . In t h is ca s e , t h e file s ys t e m m a y a ls o b e u n re g is t e re d ( b y in vo kin g t h e

unregister_filesystem( ) fu n ct io n ) wh e n t h e m o d u le is u n lo a d e d .

Th e get_fs_type( ) fu n ct io n , wh ich re ce ive s a file s ys t e m n a m e a s it s p a ra m e t e r, s ca n s t h e lis t o f re g is t e re d file s ys t e m s lo o kin g a t t h e name fie ld o f t h e ir d e s crip t o rs , a n d re t u rn s a p o in t e r t o t h e co rre s p o n d in g file_system_type o b je ct , if it is p re s e n t .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.4 Filesystem Mounting Ea ch file s ys t e m h a s it s o wn ro o t d ire ct o ry . Th e file s ys t e m wh o s e ro o t d ire ct o ry is t h e ro o t o f t h e s ys t e m 's d ire ct o ry t re e is ca lle d ro o t file s y s t e m . Ot h e r file s ys t e m s ca n b e m o u n t e d o n t h e s ys t e m 's d ire ct o ry t re e ; t h e d ire ct o rie s o n wh ich t h e y a re in s e rt e d a re ca lle d m o u n t p o in t s . A m o u n t e d file s ys t e m is t h e ch ild o f t h e m o u n t e d file s ys t e m t o wh ich t h e m o u n t p o in t d ire ct o ry b e lo n g s . Fo r in s t a n ce , t h e / p ro c virt u a l file s ys t e m is a ch ild o f t h e ro o t file s ys t e m ( a n d t h e ro o t file s ys t e m is t h e p a re n t o f / p ro c) . In m o s t t ra d it io n a l Un ix- like ke rn e ls , e a ch file s ys t e m ca n b e m o u n t e d o n ly o n ce . S u p p o s e t h a t a n Ext 2 file s ys t e m s t o re d in t h e / d e v / fd 0 flo p p y d is k is m o u n t e d o n / flp b y is s u in g t h e co m m a n d :

mount -t ext2 /dev/fd0 /flp Un t il t h e file s ys t e m is u n m o u n t e d b y is s u in g a umount co m m a n d , a n y o t h e r m o u n t co m m a n d a ct in g o n / d e v / fd 0 fa ils . Ho we ve r, Lin u x 2 . 4 is d iffe re n t : it is p o s s ib le t o m o u n t t h e s a m e file s ys t e m s e ve ra l t im e s . Fo r in s t a n ce , is s u in g t h e fo llo win g co m m a n d rig h t a ft e r t h e p re vio u s o n e will like ly s u cce e d in Lin u x:

mount -t ext2 -o ro /dev/fd0 /flp-ro As a re s u lt , t h e Ext 2 file s ys t e m s t o re d in t h e flo p p y d is k is m o u n t e d b o t h o n / flp a n d o n / flp ro ; t h e re fo re , it s file s ca n b e a cce s s e d t h ro u g h b o t h / flp a n d / flp - ro ( in t h is e xa m p le , a cce s s e s t h ro u g h / flp - ro a re re a d - o n ly) . Of co u rs e , if a m o u n t p o in t s , s e ve ra l p a t h s , m a tte r of how

file s ys t e m is m o u n t e d n t im e s , it s ro o t d ire ct o ry ca n b e a cce s s e d t h ro u g h n o n e p e r m o u n t o p e ra t io n . Alt h o u g h t h e s a m e file s ys t e m ca n b e a cce s s e d b y it is re a lly u n iq u e . Th u s , t h e re is ju s t o n e s u p e rb lo ck o b je ct fo r a ll o f t h e m , n o m a n y t im e s it h a s b e e n m o u n t e d .

Mo u n t e d file s ys t e m s fo rm a h ie ra rch y: t h e m o u n t p o in t o f a file s ys t e m m ig h t b e a d ire ct o ry o f a s e co n d file s ys t e m , wh ich in t u rn is a lre a d y m o u n t e d o ve r a t h ird file s ys t e m , a n d s o on.[ 6 ] [6]

Qu it e s u rp ris in g ly, t h e m o u n t p o in t o f a file s ys t e m m ig h t b e a d ire ct o ry o f t h e s a m e file s ys t e m , p ro vid e d t h a t it wa s a lre a d y m o u n t e d b e fo re . Fo r in s t a n ce :

mount -t ext2 /dev/fd0 /flp; touch /flp/foo mkdir /flp/mnt; mount -t ext2 /dev/fd0 /flp/mnt

No w, t h e e m p t y fo o file o n t h e flo p p y file s ys t e m ca n b e a cce s s e d b o t h a s flp . fo o a n d flp / m n t / fo o . It is a ls o p o s s ib le t o s t a ck m u lt ip le m o u n t s o n a s in g le m o u n t p o in t . Ea ch n e w m o u n t o n t h e s a m e m o u n t p o in t h id e s t h e p re vio u s ly m o u n t e d file s ys t e m , a lt h o u g h p ro ce s s e s a lre a d y

u s in g t h e file s a n d d ire ct o rie s u n d e r t h e o ld m o u n t ca n co n t in u e t o d o s o . Wh e n t h e t o p m o s t m o u n t in g is re m o ve d , t h e n t h e n e xt lo we r m o u n t is o n ce m o re m a d e vis ib le . As yo u ca n im a g in e , ke e p in g t ra ck o f m o u n t e d file s ys t e m s ca n q u ickly b e co m e a n ig h t m a re . Fo r e a ch m o u n t o p e ra t io n , t h e ke rn e l m u s t s a ve in m e m o ry t h e m o u n t p o in t a n d t h e m o u n t fla g s , a s we ll a s t h e re la t io n s h ip s b e t we e n t h e file s ys t e m t o b e m o u n t e d a n d t h e o t h e r m o u n t e d file s ys t e m s . S u ch in fo rm a t io n is s t o re d in d a t a s t ru ct u re s n a m e d m o u n t e d file s y s t e m d e s crip t o rs ; e a ch d e s crip t o r is a d a t a s t ru ct u re t h a t h a s t yp e vfsmount, wh o s e fie ld s a re s h o wn in Ta b le 1 2 - 1 1 .

Ta b le 1 2 - 1 1 . Th e fie ld s o f t h e v fs m o u n t d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct list_head

mnt_hash

Po in t e rs fo r t h e h a s h t a b le lis t

struct vfsmount *

mnt_parent

Po in t s t o t h e p a re n t file s ys t e m o n wh ich t h is file s ys t e m is m o u n t e d o n

struct dentry *

mnt_mountpoint

Po in t s t o t h e dentry o f t h e m o u n t d ire ct o ry o f t h is file s ys t e m

struct dentry *

mnt_root

Po in t s t o t h e dentry o f t h e ro o t d ire ct o ry o f t h is file s ys t e m

struct super_block * mnt_sb

Po in t s t o t h e s u p e rb lo ck o b je ct o f t h is file s ys t e m

struct list_head

He a d o f t h e p a re n t lis t o f d e s crip t o rs ( re la t ive t o t h is file s ys t e m )

mnt_mounts

Po in t e rs fo r t h e p a re n t lis t o f d e s crip t o rs

struct list_head

mnt_child ( re la t ive t o t h e p a re n t file s ys t e m )

atomic_t

mnt_count

Us a g e co u n t e r

int

mnt_flags

Fla g s

char *

mnt_devname

De vice file n a m e

struct list_head

mnt_list

Po in t e rs fo r g lo b a l lis t o f d e s crip t o rs

Th e vfsmount d a t a s t ru ct u re s a re ke p t in s e ve ra l d o u b ly lin ke d circu la r lis t s :





A circu la r d o u b ly lin ke d "g lo b a l" lis t in clu d in g t h e d e s crip t o rs o f a ll m o u n t e d file s ys t e m s . Th e h e a d o f t h e lis t is a firs t d u m m y e le m e n t , wh ich is re p re s e n t e d b y t h e vfsmntlist va ria b le . Th e mnt_list fie ld o f t h e d e s crip t o r co n t a in s t h e p o in t e rs t o a d ja ce n t e le m e n t s in t h e lis t . An h a s h t a b le in d e xe d b y t h e a d d re s s o f t h e vfsmount d e s crip t o r o f t h e p a re n t file s ys t e m a n d t h e a d d re s s o f t h e d e n t ry o b je ct o f t h e m o u n t p o in t d ire ct o ry. Th e h a s h t a b le is s t o re d in t h e mount_hashtable a rra y, wh o s e s ize d e p e n d s o n t h e a m o u n t o f RAM in t h e s ys t e m . Ea ch it e m o f t h e t a b le is t h e h e a d o f a circu la r d o u b ly lin ke d lis t s t o rin g a ll d e s crip t o rs t h a t h a ve t h e s a m e h a s h va lu e . Th e mnt_hash fie ld



o f t h e d e s crip t o r co n t a in s t h e p o in t e rs t o a d ja ce n t e le m e n t s in t h is lis t . Fo r e a ch m o u n t e d file s ys t e m , a circu la r d o u b ly lin ke d lis t in clu d in g a ll ch ild m o u n t e d file s ys t e m s . Th e h e a d o f e a ch lis t is s t o re d in t h e mnt_mounts fie ld o f t h e m o u n t e d file s ys t e m d e s crip t o r; m o re o ve r, t h e mnt_child fie ld o f t h e d e s crip t o r s t o re s t h e p o in t e rs t o t h e a d ja ce n t e le m e n t s in t h e lis t .

Th e mount_sem s e m a p h o re p ro t e ct s t h e lis t s o f m o u n t e d file s ys t e m o b je ct s fro m co n cu rre n t a cce s s e s . Th e mnt_flags fie ld o f t h e d e s crip t o r s t o re s t h e va lu e o f s e ve ra l fla g s t h a t s p e cify h o w s o m e kin d s o f file s in t h e m o u n t e d file s ys t e m a re h a n d le d . Th e fla g s a re lis t e d in Ta b le 1 2 12.

Ta b le 1 2 - 1 2 . Mo u n t e d file s y s t e m fla g s

Na m e

D e s c rip t io n

MNT_NOSUID

Fo rb id setuid a n d setgid fla g s in t h e m o u n t e d file s ys t e m

MNT_NODEV

Fo rb id a cce s s t o d e vice file s in t h e m o u n t e d file s ys t e m

MNT_NOEXEC

Dis a llo w p ro g ra m e xe cu t io n in t h e m o u n t e d file s ys t e m

Th e fo llo win g fu n ct io n s h a n d le t h e m o u n t e d file s ys t e m d e s crip t o rs :

alloc_vfsmnt( ) Allo ca t e s a n d in it ia lize s a m o u n t e d file s ys t e m d e s crip t o r

free_vfsmnt(mnt) Fre e s a m o u n t e d file s ys t e m d e s crip t o r p o in t e d b y mnt

lookup_mnt( parent,mountpoint)

Lo o ks u p a d e s crip t o r in t h e h a s h t a b le a n d re t u rn s it s a d d re s s

12.4.1 Mounting the Root Filesystem Mo u n t in g t h e ro o t file s ys t e m is a cru cia l p a rt o f s ys t e m in it ia liza t io n . It is a fa irly co m p le x p ro ce d u re b e ca u s e t h e Lin u x ke rn e l a llo ws t h e ro o t file s ys t e m t o b e s t o re d in m a n y d iffe re n t p la ce s , s u ch a s a h a rd d is k p a rt it io n , a flo p p y d is k, a re m o t e file s ys t e m s h a re d via NFS , o r e ve n a fict it io u s b lo ck d e vice ke p t in RAM. To ke e p t h e d e s crip t io n s im p le , le t 's a s s u m e t h a t t h e ro o t file s ys t e m is s t o re d in a p a rt it io n o f a h a rd d is k ( t h e m o s t co m m o n ca s e , a ft e r a ll) . Wh ile t h e s ys t e m b o o t s , t h e ke rn e l fin d s t h e m a jo r n u m b e r o f t h e d is k t h a t co n t a in s t h e ro o t file s ys t e m in t h e ROOT_DEV va ria b le . Th e ro o t file s ys t e m ca n b e s p e cifie d a s a d e vice file in t h e / d e v d ire ct o ry e it h e r wh e n co m p ilin g t h e ke rn e l o r b y p a s s in g a s u it a b le "ro o t " o p t io n t o t h e in it ia l b o o t s t ra p lo a d e r. S im ila rly, t h e m o u n t fla g s o f t h e ro o t file s ys t e m a re s t o re d in t h e root_mountflags va ria b le . Th e u s e r s p e cifie s t h e s e fla g s e it h e r b y u s in g t h e rd e v e xt e rn a l p ro g ra m o n a co m p ile d ke rn e l im a g e o r b y p a s s in g a s u it a b le ro o t fla g s o p t io n t o t h e in it ia l b o o t s t ra p lo a d e r ( s e e Ap p e n d ix A) . Mo u n t in g t h e ro o t file s ys t e m is a t wo - s t a g e p ro ce d u re , s h o wn in t h e fo llo win g lis t . 1 . Th e ke rn e l m o u n t s t h e s p e cia l ro o t fs file s ys t e m , wh ich ju s t p ro vid e s a n e m p t y d ire ct o ry t h a t s e rve s a s in it ia l m o u n t p o in t . 2 . Th e ke rn e l m o u n t s t h e re a l ro o t file s ys t e m o ve r t h e e m p t y d ire ct o ry. Wh y d o e s t h e ke rn e l b o t h e r t o m o u n t t h e ro o t fs file s ys t e m b e fo re t h e re a l o n e ? We ll, t h e ro o t fs file s ys t e m a llo ws t h e ke rn e l t o e a s ily ch a n g e t h e re a l ro o t file s ys t e m . In fa ct , in s o m e ca s e s , t h e ke rn e l m o u n t s a n d u n m o u n t s s e ve ra l ro o t file s ys t e m s , o n e a ft e r t h e o t h e r. Fo r in s t a n ce , t h e in it ia l b o o t s t ra p flo p p y d is k o f a d is t rib u t io n m ig h t lo a d in RAM a ke rn e l wit h a m in im a l s e t o f d rive rs , wh ich m o u n t s a s ro o t a m in im a l file s ys t e m s t o re d in a RAM d is k. Ne xt , t h e p ro g ra m s in t h is in it ia l ro o t file s ys t e m p ro b e t h e h a rd wa re o f t h e s ys t e m ( fo r in s t a n ce , t h e y d e t e rm in e wh e t h e r t h e h a rd d is k is EIDE, S CS I, o r wh a t e ve r) , lo a d a ll n e e d e d ke rn e l m o d u le s , a n d re m o u n t t h e ro o t file s ys t e m fro m a p h ys ica l b lo ck d e vice . Th e firs t s t a g e is p e rfo rm e d b y t h e init_mount_tree( ) fu n ct io n , wh ich is e xe cu t e d d u rin g s ys t e m in it ia liza t io n :

struct file_system_type root_fs_type; root_fs_type.name = "rootfs"; root_fs_type.read_super = rootfs_read_super; root_fs_type.fs_flags = FS_NOMOUNT; register_filesystem(&root_fs_type); root_vfsmnt = do_kern_mount("rootfs", 0, "rootfs", NULL); Th e root_fs_type va ria b le s t o re s t h e d e s crip t o r o b je ct o f t h e ro o t fs s p e cia l file s ys t e m ; it s fie ld s a re in it ia lize d , a n d t h e n it is p a s s e d t o t h e register_filesystem( ) fu n ct io n ( s e e t h e e a rlie r s e ct io n S e ct io n 1 2 . 3 . 2 ) . Th e do_kern_mount( ) fu n ct io n m o u n t s t h e s p e cia l file s ys t e m a n d re t u rn s t h e a d d re s s o f a n e w m o u n t e d file s ys t e m o b je ct ; t h is a d d re s s is s a ve d b y init_mount_tree( ) in t h e root_vfsmnt va ria b le . Fro m n o w o n ,

root_vfsmnt re p re s e n t s t h e ro o t o f t h e t re e o f t h e m o u n t e d file s ys t e m s .

Th e do_kern_mount( ) fu n ct io n re ce ive s t h e fo llo win g p a ra m e t e rs :

type Th e t yp e o f file s ys t e m t o b e m o u n t e d

flags Th e m o u n t fla g s ( s e e Ta b le 1 2 - 1 3 in t h e la t e r s e ct io n S e ct io n 1 2 . 4 . 2 )

name Th e d e vice file n a m e o f t h e b lo ck d e vice s t o rin g t h e file s ys t e m ( o r t h e file s ys t e m t yp e n a m e fo r s p e cia l file s ys t e m s )

data Po in t e rs t o a d d it io n a l d a t a t o b e p a s s e d t o t h e read_super m e t h o d o f t h e file s ys t e m Th e fu n ct io n t a ke s ca re o f t h e a ct u a l m o u n t o p e ra t io n b y p e rfo rm in g t h e fo llo win g o p e ra t io n s : 1 . Ch e cks wh e t h e r t h e cu rre n t p ro ce s s h a s t h e p rivile g e s fo r t h e m o u n t o p e ra t io n ( t h e ch e ck a lwa ys s u cce e d s wh e n t h e fu n ct io n is in vo ke d b y init_mount_tree( ) b e ca u s e t h e s ys t e m in it ia liza t io n is ca rrie d o n b y a p ro ce s s o wn e d b y ro o t ) . 2 . In vo ke s get_fs_type( ) t o s e a rch in t h e lis t o f file s ys t e m t yp e s a n d lo ca t e t h e n a m e s t o re d in t h e type p a ra m e t e r; get_fs_type( ) re t u rn s t h e a d d re s s o f t h e co rre s p o n d in g file_system_type d e s crip t o r.

3 . In vo ke s alloc_vfsmnt( ) t o a llo ca t e a n e w m o u n t e d file s ys t e m d e s crip t o r a n d s t o re s it s a d d re s s in t h e mnt lo ca l va ria b le .

4 . In it ia lize s t h e mnt->mnt_devname fie ld wit h t h e co n t e n t o f t h e name p a ra m e t e r.

5 . Allo ca t e s a n e w s u p e rb lo ck a n d in it ia lize s it . do_kern_mount( ) ch e cks t h e fla g s in t h e file_system_type d e s crip t o r t o d e t e rm in e h o w t o d o t h is :

a . If FS_REQUIRES_DEV is o n , in vo ke s get_sb_bdev( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 4 . 2 ) b . If FS_SINGLE is o n , in vo ke s get_sb_single( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 4 . 2 ) c. Ot h e rwis e , in vo ke s get_sb_nodev( )

6 . If t h e FS_NOMOUNT fla g in t h e file_system_type d e s crip t o r is o n , s e t s t h e

MS_NOUSER fla g in t h e s u p e rb lo ck o b je ct . 7 . In it ia lize s t h e mnt->mnt_sb fie ld wit h t h e a d d re s s o f t h e n e w s u p e rb lo ck o b je ct .

8 . In it ia lize s t h e mnt->mnt_root a n d mnt->mnt_mountpoint fie ld s wit h t h e a d d re s s o f t h e d e n t ry o b je ct co rre s p o n d in g t o t h e ro o t d ire ct o ry o f t h e file s ys t e m . 9 . In it ia lize s t h e mnt->mnt_parent fie ld wit h t h e va lu e in mnt ( t h e n e wly m o u n t e d file s ys t e m h a s n o p a re n t ) . 1 0 . Re le a s e s t h e s_umount s e m a p h o re o f t h e s u p e rb lo ck o b je ct ( it wa s a cq u ire d wh e n t h e o b je ct wa s a llo ca t e d in S t e p 5 ) . 1 1 . Re t u rn s t h e a d d re s s mnt o f t h e m o u n t e d file s ys t e m o b je ct .

Wh e n t h e do_kern_mount( ) fu n ct io n is in vo ke d b y init_mount_tree( ) t o m o u n t t h e ro o t fs s p e cia l file s ys t e m , n e it h e r t h e FS_REQUIRES_DEV fla g n o r t h e FS_SINGLE fla g a re s e t , s o t h e fu n ct io n u s e s get_sb_nodev( ) t o a llo ca t e t h e s u p e rb lo ck o b je ct . Th is fu n ct io n e xe cu t e s t h e fo llo win g s t e p s : 1 . In vo ke s get_unnamed_dev( ) t o a llo ca t e a n e w fict it io u s b lo ck d e vice id e n t ifie r ( s e e t h e e a rlie r s e ct io n S e ct io n 1 2 . 3 . 1 ) . 2 . In vo ke s t h e read_super( ) fu n ct io n , p a s s in g t o it t h e file s ys t e m t yp e o b je ct , t h e m o u n t fla g s , a n d t h e fict it io u s b lo ck d e vice id e n t ifie r. In t u rn , t h is fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : a . Allo ca t e s a n e w s u p e rb lo ck o b je ct a n d p u t s it s a d d re s s in t h e lo ca l va ria b le s.

b . In it ia lize s t h e s->s_dev fie ld wit h t h e b lo ck d e vice id e n t ifie r.

c. In it ia lize s t h e s->s_flags fie ld wit h t h e m o u n t fla g s ( s e e Ta b le 1 2 - 1 3 ) .

d . In it ia lize s t h e s->s_type fie ld wit h t h e file s ys t e m t yp e d e s crip t o r o f t h e file s ys t e m . e . Acq u ire s t h e sb_lock s p in lo ck.

f. In s e rt s t h e s u p e rb lo ck in t h e g lo b a l circu la r lis t wh o s e h e a d is super_blocks.

g . In s e rt s t h e s u p e rb lo ck in t h e file s ys t e m t yp e lis t wh o s e h e a d is s->s_type-

>fs_supers.

h . Re le a s e s t h e sb_lock s p in lo ck.

i. Acq u ire s fo r writ in g t h e s->s_umount re a d / writ e s e m a p h o re .

j. Acq u ire s t h e s->s_lock s e m a p h o re .

k. In vo ke s t h e read_super m e t h o d o f t h e file s ys t e m t yp e .

l. S e t s t h e MS_ACTIVE fla g in s->s_flags.

m . Re le a s e s t h e s->s_lock s e m a p h o re .

n . Re t u rn s t h e a d d re s s s o f t h e s u p e rb lo ck.

3 . If t h e file s ys t e m t yp e is im p le m e n t e d b y a ke rn e l m o d u le , in cre m e n t s it s u s a g e co u n t e r. 4 . Re t u rn s t h e a d d re s s o f t h e n e w s u p e rb lo ck. Th e s e co n d s t a g e o f t h e m o u n t o p e ra t io n fo r t h e ro o t file s ys t e m is p e rfo rm e d b y t h e mount_root( ) fu n ct io n n e a r t h e e n d o f t h e s ys t e m in it ia liza t io n . Fo r t h e s a ke o f b re vit y, we co n s id e r t h e ca s e o f a d is k- b a s e d file s ys t e m wh o s e d e vice file s a re h a n d le d in t h e t ra d it io n a l wa y ( we b rie fly d is cu s s in Ch a p t e r 1 3 h o w t h e d e v fs virt u a l file s ys t e m o ffe rs a n a lt e rn a t ive wa y t o h a n d le d e vice file s ) . In t h is ca s e , t h e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Allo ca t e s a b u ffe r a n d fills it wit h a lis t o f file s ys t e m t yp e n a m e s . Th is lis t is e it h e r p a s s e d t o t h e ke rn e l in t h e ro o t fs t y p e b o o t p a ra m e t e r o r is b u ilt b y s ca n n in g t h e e le m e n t s in t h e s im p ly lin ke d lis t o f file s ys t e m t yp e s . 2 . In vo ke s t h e bdget( ) a n d blkdev_get( ) fu n ct io n s t o ch e ck wh e t h e r t h e

ROOT_DEV ro o t d e vice e xis t s a n d is p ro p e rly wo rkin g . 3 . In vo ke s get_super( ) t o s e a rch fo r a s u p e rb lo ck o b je ct a s s o cia t e d wit h t h e

ROOT_DEV d e vice in t h e super_blocks lis t . Us u a lly n o n e is fo u n d b e ca u s e t h e ro o t file s ys t e m is s t ill t o b e m o u n t e d . Th e ch e ck is m a d e , h o we ve r, b e ca u s e it is p o s s ib le t o re m o u n t a p re vio u s ly m o u n t e d file s ys t e m . Us u a lly t h e ro o t file s ys t e m is m o u n t e d t wice d u rin g t h e s ys t e m b o o t : t h e firs t t im e a s a re a d - o n ly file s ys t e m s o t h a t it s in t e g rit y ca n b e s a fe ly ch e cke d ; t h e s e co n d t im e fo r re a d in g a n d writ in g s o t h a t n o rm a l o p e ra t io n s ca n s t a rt . We 'll s u p p o s e t h a t n o s u p e rb lo ck o b je ct a s s o cia t e d wit h t h e ROOT_DEV d e vice is fo u n d in t h e super_blocks lis t .

4 . S ca n s t h e lis t o f file s ys t e m t yp e n a m e s b u ilt in S t e p 1 . Fo r e a ch n a m e , in vo ke s get_fs_type( ) t o g e t t h e co rre s p o n d in g file_system_type o b je ct , a n d in vo ke s

read_super( ) t o a t t e m p t t o re a d t h e co rre s p o n d in g s u p e rb lo ck fro m d is k. As d e s crib e d e a rlie r, t h is fu n ct io n a llo ca t e s a n e w s u p e rb lo ck o b je ct a n d a t t e m p t s t o fill it b y u s in g t h e m e t h o d t o wh ich t h e read_super fie ld o f t h e file_system_type o b je ct p o in t s . S in ce e a ch file s ys t e m - s p e cific m e t h o d u s e s u n iq u e m a g ic n u m b e rs , a ll

read_super( ) in vo ca t io n s will fa il e xce p t t h e o n e t h a t a t t e m p t s t o fill t h e

s u p e rb lo ck b y u s in g t h e m e t h o d o f t h e file s ys t e m re a lly u s e d o n t h e ro o t d e vice . Th e read_super( ) m e t h o d a ls o cre a t e s a n in o d e o b je ct a n d a d e n t ry o b je ct fo r t h e ro o t d ire ct o ry; t h e d e n t ry o b je ct m a p s t o t h e in o d e o b je ct . 5 . Allo ca t e s a n e w m o u n t e d file s ys t e m o b je ct a n d in it ia lize s it s fie ld s wit h t h e ROOT_DEV b lo ck d e vice n a m e , t h e a d d re s s o f t h e s u p e rb lo ck o b je ct , a n d t h e a d d re s s o f t h e d e n t ry o b je ct o f t h e ro o t d ire ct o ry. 6 . In vo ke s t h e graft_tree( ) fu n ct io n , wh ich in s e rt s t h e n e w m o u n t e d file s ys t e m o b je ct in t h e ch ild re n lis t o f root_vfsmnt, in t h e g lo b a l lis t o f m o u n t e d file s ys t e m o b je ct s , a n d in t h e mount_hashtable h a s h t a b le .

7 . S e t s t h e root a n d pwd fie ld s o f t h e fs_struct t a b le o f current ( t h e in it p ro ce s s ) t o t h e d e n t ry o b je ct o f t h e ro o t d ire ct o ry.

12.4.2 Mounting a Generic Filesystem On ce t h e ro o t file s ys t e m is in it ia lize d , a d d it io n a l file s ys t e m s m a y b e m o u n t e d . Ea ch m u s t h a ve it s o wn m o u n t p o in t , wh ich is ju s t a n a lre a d y e xis t in g d ire ct o ry in t h e s ys t e m 's d ire ct o ry t re e . Th e mount( ) s ys t e m ca ll is u s e d t o m o u n t a file s ys t e m ; it s sys_mount( ) s e rvice ro u t in e a ct s o n t h e fo llo win g p a ra m e t e rs : ●

Th e p a t h n a m e o f a d e vice file co n t a in in g t h e file s ys t e m , o r NULL if it is n o t re q u ire d



( fo r in s t a n ce , wh e n t h e file s ys t e m t o b e m o u n t e d is n e t wo rk- b a s e d ) Th e p a t h n a m e o f t h e d ire ct o ry o n wh ich t h e file s ys t e m will b e m o u n t e d ( t h e m o u n t p o in t ) Th e file s ys t e m t yp e , wh ich m u s t b e t h e n a m e o f a re g is t e re d file s ys t e m Th e m o u n t fla g s ( p e rm it t e d va lu e s a re lis t e d in Ta b le 1 2 - 1 3 )



A p o in t e r t o a file s ys t e m - d e p e n d e n t d a t a s t ru ct u re ( wh ich m a y b e NULL)





Ta b le 1 2 - 1 3 . Mo u n t fla g s

Ma c ro

D e s c rip t io n

MS_RDONLY

File s ca n o n ly b e re a d

MS_NOSUID

Fo rb id setuid a n d setgid fla g s

MS_NODEV

Fo rb id a cce s s t o d e vice file s

MS_NOEXEC

Dis a llo w p ro g ra m e xe cu t io n

MS_SYNCHRONOUS Writ e o p e ra t io n s a re im m e d ia t e

MS_REMOUNT

Re m o u n t t h e file s ys t e m ch a n g in g t h e m o u n t fla g s

MS_MANDLOCK

Ma n d a t o ry lo ckin g a llo we d

MS_NOATIME

Do n o t u p d a t e file a cce s s t im e

MS_NODIRATIME Do n o t u p d a t e d ire ct o ry a cce s s t im e

MS_BIND

Cre a t e a "b in d m o u n t , " wh ich a llo ws m a kin g a file o r d ire ct o ry vis ib le a t a n o t h e r p o in t o f t h e s ys t e m d ire ct o ry t re e

MS_MOVE

At o m ica lly m o ve a m o u n t e d file s ys t e m o n a n o t h e r m o u n t p o in t

MS_REC

S h o u ld re cu rs ive ly cre a t e "b in d m o u n t s " fo r a d ire ct o ry s u b t re e ( s t ill u n fin is h e d in 2 . 4 . 1 8 )

MS_VERBOSE

Ge n e ra t e ke rn e l m e s s a g e s o n m o u n t e rro rs

Th e sys_mount( ) fu n ct io n co p ie s t h e va lu e o f t h e p a ra m e t e rs in t o t e m p o ra ry ke rn e l b u ffe rs , a cq u ire s t h e b ig ke rn e l lo ck, a n d in vo ke s t h e do_mount( ) fu n ct io n . On ce

do_mount( ) re t u rn s , t h e s e rvice ro u t in e re le a s e s t h e b ig ke rn e l lo ck a n d fre e s t h e t e m p o ra ry ke rn e l b u ffe rs . Th e do_mount( ) fu n ct io n t a ke s ca re o f t h e a ct u a l m o u n t o p e ra t io n b y p e rfo rm in g t h e fo llo win g o p e ra t io n s : 1 . Ch e cks wh e t h e r t h e s ixt e e n h ig h e s t - o rd e r b it s o f t h e m o u n t fla g s a re s e t t o t h e "m a g ic" va lu e 0xce0d; in t h is ca s e , t h e y a re cle a re d . Th is is a le g a cy h a ck t h a t a llo ws t h e sys_mount( ) s e rvice ro u t in e t o b e u s e d wit h o ld C lib ra rie s t h a t d o n o t h a n d le t h e h ig h e s t - o rd e r fla g s . 2 . If a n y o f t h e MS_NOSUID, MS_NODEV, o r MS_NOEXEC fla g s p a s s e d a s a p a ra m e t e r a re s e t , cle a rs t h e m a n d s e t s t h e co rre s p o n d in g fla g ( MNT_NOSUID, MNT_NODEV,

MNT_NOEXEC) in t h e m o u n t e d file s ys t e m o b je ct . 3 . Lo o ks u p t h e p a t h n a m e o f t h e m o u n t p o in t b y in vo kin g path_init( ) a n d

path_walk( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 5 ) . 4 . Exa m in e s t h e m o u n t fla g s t o d e t e rm in e wh a t h a s t o b e d o n e . In p a rt icu la r: a . If t h e MS_REMOUNT fla g is s p e cifie d , t h e p u rp o s e is u s u a lly t o ch a n g e t h e m o u n t fla g s in t h e s_flags fie ld o f t h e s u p e rb lo ck o b je ct a n d t h e m o u n t e d file s ys t e m fla g s in t h e mnt_flags fie ld o f t h e m o u n t e d file s ys t e m o b je ct . Th e do_remount( ) fu n ct io n p e rfo rm s t h e s e ch a n g e s .

b . Ot h e rwis e , ch e cks t h e MS_BIND fla g . If it is s p e cifie d , t h e u s e r is a s kin g t o m a ke vis ib le a file o r d ire ct o ry o n a n o t h e r p o in t o f t h e s ys t e m d ire ct o ry t re e . Us u a lly, t h is is d o n e wh e n m o u n t in g a file s ys t e m s t o re d in a re g u la r file in s t e a d o f a p h ys ica l d is k p a rt it io n ( lo o p b a ck ) . Th e do_loopback( ) fu n ct io n a cco m p lis h e s t h is t a s k. c. Ot h e rwis e , ch e cks t h e MS_MOVE fla g . If it is s p e cifie d , t h e u s e r is a s kin g t o ch a n g e t h e m o u n t p o in t o f a n a lre a d y m o u n t e d file s ys t e m . Th e do_move_mount( ) fu n ct io n d o e s t h is a t o m ica lly.

d . Ot h e rwis e , in vo ke s do_add_mount( ). Th is is t h e m o s t co m m o n ca s e . It is t rig g e re d wh e n t h e u s e r a s ks t o m o u n t e it h e r a s p e cia l file s ys t e m o r a re g u la r file s ys t e m s t o re d in a d is k p a rt it io n . do_add_mount( ) p e rfo rm s t h e fo llo win g a ct io n s : a . In vo ke s do_kern_mount( ) p a s s in g , t o it t h e file s ys t e m t yp e , t h e m o u n t fla g s , a n d t h e b lo ck d e vice n a m e . As a lre a d y d e s crib e d in S e ct io n 1 2 . 4 . 1 , do_kern_mount( ) t a ke s ca re o f t h e a ct u a l m o u n t o p e ra t io n . b . Acq u ire s t h e mount_sem s e m a p h o re .

c. In it ia lize s t h e fla g s in t h e mnt_flags fie ld o f t h e n e w m o u n t e d file s ys t e m o b je ct a llo ca t e d b y do_kern_mount( ).

d . In vo ke s graft_tree( ) t o in s e rt t h e n e w m o u n t e d file s ys t e m o b je ct in t h e g lo b a l lis t , in t h e h a s h t a b le , a n d in t h e ch ild re n lis t o f t h e p a re n t - m o u n t e d file s ys t e m . e . Re le a s e s t h e mount_sem s e m a p h o re .

5 . In vo ke s path_release( ) t o t e rm in a t e t h e p a t h n a m e lo o ku p o f t h e m o u n t p o in t ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 5 ) . Th e co re o f t h e m o u n t o p e ra t io n is t h e do_kern_mount( ) fu n ct io n , wh ich we a lre a d y d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 2 . 4 . 1 . Re ca ll t h a t t h is fu n ct io n ch e cks t h e file s ys t e m t yp e fla g s t o d e t e rm in e h o w t h e m o u n t o p e ra t io n is t o b e d o n e . Fo r a re g u la r d is k- b a s e d file s ys t e m , t h e FS_REQUIRES_DEV fla g is s e t , s o do_kern_mount( ) in vo ke s t h e

get_sb_bdev( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s path_init( ) a n d path_walk( ) t o lo o k u p t h e p a t h n a m e o f t h e m o u n t p o in t ( s e e S e ct io n 1 2 . 5 ) . 2 . In vo ke s blkdev_get( ) t o o p e n t h e b lo ck d e vice s t o rin g t h e re g u la r file s ys t e m .

3 . S e a rch e s t h e lis t o f s u p e rb lo ck o b je ct s ; if a s u p e rb lo ck re la t ive t o t h e b lo ck d e vice is a lre a d y p re s e n t , re t u rn s it s a d d re s s . Th is m e a n s t h a t t h e file s ys t e m is a lre a d y m o u n t e d a n d will b e m o u n t e d a g a in .

4 . Ot h e rwis e , a llo ca t e s a n e w s u p e rb lo ck o b je ct , in it ia lize s it s s_dev, s_bdev,

s_flags, a n d s_type fie ld s , a n d in s e rt s it in t o t h e g lo b a l lis t s o f s u p e rb lo cks a n d t h e s u p e rb lo ck lis t o f t h e file s ys t e m t yp e d e s crip t o r. 5 . Acq u ire s t h e s_lock s p in lo ck o f t h e s u p e rb lo ck.

6 . In vo ke s t h e read_super m e t h o d o f t h e file s ys t e m t yp e t o a cce s s t h e s u p e rb lo ck in fo rm a t io n o n d is k a n d fill t h e o t h e r fie ld s o f t h e n e w s u p e rb lo ck o b je ct . 7 . S e t s t h e MS_ACTIVE fla g o f t h e s u p e rb lo ck.

8 . Re le a s e s t h e s_lock s p in lo ck o f t h e s u p e rb lo ck.

9 . If t h e file s ys t e m t yp e is im p le m e n t e d b y a ke rn e l m o d u le , in cre m e n t s it s u s a g e co u n t e r. 1 0 . In vo ke s path_release( ) t o t e rm in a t e t h e m o u n t p o in t lo o ku p o p e ra t io n .

1 1 . Re t u rn s t h e a d d re s s o f t h e n e w s u p e rb lo ck o b je ct .

12.4.3 Unmounting a Filesystem Th e umount( ) s ys t e m ca ll is u s e d t o u n m o u n t a file s ys t e m . Th e co rre s p o n d in g

sys_umount( ) s e rvice ro u t in e a ct s o n t wo p a ra m e t e rs : a file n a m e ( e it h e r a m o u n t p o in t d ire ct o ry o r a b lo ck d e vice file n a m e ) a n d a s e t o f fla g s . It p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s path_init( ) a n d path_walk( ) t o lo o k u p t h e m o u n t p o in t p a t h n a m e ( s e e t h e n e xt s e ct io n ) . On ce fin is h e d , t h e fu n ct io n s re t u rn t h e a d d re s s d o f t h e d e n t ry o b je ct co rre s p o n d in g t o t h e p a t h n a m e . 2 . If t h e re s u lt in g d ire ct o ry is n o t t h e m o u n t p o in t o f a file s ys t e m , re t u rn s t h e -EINVAL e rro r co d e . Th is ch e ck is d o n e b y ve rifyin g t h a t d->mnt->mnt_root co n t a in s t h e a d d re s s o f t h e d e n t ry o b je ct d.

3 . If t h e file s ys t e m t o b e u n m o u n t e d h a s n o t b e e n m o u n t e d o n t h e s ys t e m d ire ct o ry t re e , re t u rn s t h e -EINVAL e rro r co d e . ( Re ca ll t h a t s o m e s p e cia l file s ys t e m s h a ve n o m o u n t p o in t . ) Th is ch e ck is d o n e b y in vo kin g t h e check_mnt( ) fu n ct io n o n d-

>mnt. 4 . If t h e u s e r d o e s n o t h a ve t h e p rivile g e s re q u ire d t o u n m o u n t t h e file s ys t e m , re t u rn s t h e -EPERM e rro r co d e .

5 . In vo ke s do_umount( ), wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s :

a . Re t rie ve s t h e a d d re s s o f t h e s u p e rb lo ck o b je ct fro m t h e mnt_sb fie ld o f t h e m o u n t e d file s ys t e m o b je ct .

b . If t h e u s e r a s ke d t o fo rce t h e u n m o u n t o p e ra t io n , in t e rru p t s a n y o n g o in g m o u n t o p e ra t io n b y in vo kin g t h e umount_begin s u p e rb lo ck o p e ra t io n .

c. If t h e file s ys t e m t o b e u n m o u n t e d is t h e ro o t file s ys t e m a n d t h e u s e r d id n 't a s k t o a ct u a lly d e t a ch it , in vo ke s do_remount_sb( ) t o re m o u n t t h e ro o t file s ys t e m re a d - o n ly a n d t e rm in a t e s . d . Acq u ire s t h e mount_sem s e m a p h o re fo r writ in g a n d t h e dcache_lock d e n t ry s p in lo ck. e . If t h e m o u n t e d file s ys t e m d o e s n o t in clu d e m o u n t p o in t s fo r a n y ch ild m o u n t e d file s ys t e m , o r if t h e u s e r a s ke d t o fo rcib ly d e t a ch t h e file s ys t e m , in vo ke s umount_tree( ) t o u n m o u n t t h e file s ys t e m ( t o g e t h e r wit h a ll ch ild re n ) . f. Re le a s e s mount_sem a n d dcache_lock.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.5 Pathname Lookup In t h is s e ct io n , we illu s t ra t e h o w t h e VFS d e rive s a n in o d e fro m t h e co rre s p o n d in g file p a t h n a m e . Wh e n a p ro ce s s m u s t id e n t ify a file , it p a s s e s it s file p a t h n a m e t o s o m e VFS s ys t e m ca ll, s u ch a s open( ), mkdir( ), rename( ), o r stat( ).

Th e s t a n d a rd p ro ce d u re fo r p e rfo rm in g t h is t a s k co n s is t s o f a n a lyzin g t h e p a t h n a m e a n d b re a kin g it in t o a s e q u e n ce o f file n a m e s . All file n a m e s e xce p t t h e la s t m u s t id e n t ify d ire ct o rie s . If t h e firs t ch a ra ct e r o f t h e p a t h n a m e is / , t h e p a t h n a m e is a b s o lu t e , a n d t h e s e a rch s t a rt s fro m t h e d ire ct o ry id e n t ifie d b y current->fs->root ( t h e p ro ce s s ro o t d ire ct o ry) . Ot h e rwis e , t h e p a t h n a m e is re la t ive a n d t h e s e a rch s t a rt s fro m t h e d ire ct o ry id e n t ifie d b y current->fs->pwd ( t h e p ro ce s s - cu rre n t d ire ct o ry) .

Ha vin g in h a n d t h e in o d e o f t h e in it ia l d ire ct o ry, t h e co d e e xa m in e s t h e e n t ry m a t ch in g t h e firs t n a m e t o d e rive t h e co rre s p o n d in g in o d e . Th e n t h e d ire ct o ry file t h a t h a s t h a t in o d e is re a d fro m d is k a n d t h e e n t ry m a t ch in g t h e s e co n d n a m e is e xa m in e d t o d e rive t h e co rre s p o n d in g in o d e . Th is p ro ce d u re is re p e a t e d fo r e a ch n a m e in clu d e d in t h e p a t h . Th e d e n t ry ca ch e co n s id e ra b ly s p e e d s u p t h e p ro ce d u re , s in ce it ke e p s t h e m o s t re ce n t ly u s e d d e n t ry o b je ct s in m e m o ry. As we s a w b e fo re , e a ch s u ch o b je ct a s s o cia t e s a file n a m e in a s p e cific d ire ct o ry t o it s co rre s p o n d in g in o d e . In m a n y ca s e s , t h e re fo re , t h e a n a lys is o f t h e p a t h n a m e ca n a vo id re a d in g t h e in t e rm e d ia t e d ire ct o rie s fro m t h e d is k. Ho we ve r, t h in g s a re n o t a s s im p le a s t h e y lo o k, s in ce t h e fo llo win g Un ix a n d VFS file s ys t e m fe a t u re s m u s t b e t a ke n in t o co n s id e ra t io n : ●







Th e a cce s s rig h t s o f e a ch d ire ct o ry m u s t b e ch e cke d t o ve rify wh e t h e r t h e p ro ce s s is a llo we d t o re a d t h e d ire ct o ry's co n t e n t . A file n a m e ca n b e a s ym b o lic lin k t h a t co rre s p o n d s t o a n a rb it ra ry p a t h n a m e ; in t h is ca s e , t h e a n a lys is m u s t b e e xt e n d e d t o a ll co m p o n e n t s o f t h a t p a t h n a m e . S ym b o lic lin ks m a y in d u ce circu la r re fe re n ce s ; t h e ke rn e l m u s t t a ke t h is p o s s ib ilit y in t o a cco u n t a n d b re a k e n d le s s lo o p s wh e n t h e y o ccu r. A file n a m e ca n b e t h e m o u n t p o in t o f a m o u n t e d file s ys t e m . Th is s it u a t io n m u s t b e d e t e ct e d , a n d t h e lo o ku p o p e ra t io n m u s t co n t in u e in t o t h e n e w file s ys t e m .

Pa t h n a m e lo o ku p is p e rfo rm e d b y t h re e fu n ct io n s : path_init( ), path_walk( ), a n d

path_release( ). Th e y a re a lwa ys in vo ke d in t h is e xa ct o rd e r. Th e path_init( ) fu n ct io n re ce ive s t h re e p a ra m e t e rs :

name A p o in t e r t o t h e file p a t h n a m e t o b e re s o lve d .

flags Th e va lu e o f fla g s t h a t re p re s e n t h o w t h e lo o ke d - u p file is g o in g t o b e a cce s s e d . Th e

fla g s a re lis t e d in Ta b le 1 2 - 1 7 in t h e la t e r s e ct io n S e ct io n 1 2 . 6 . 1 . [ 7 ] [7]

Th e re is , h o we ve r, a s m a ll d iffe re n ce in h o w t h e O_RDONLY, O_WRONLY,

a n d O_RDWR fla g s a re e n co d e d . Th e b it a t in d e x 0 ( lo we s t - o rd e r) o f t h e

flags p a ra m e t e r is s e t o n ly if t h e file a cce s s re q u ire s re a d p rivile g e s ; s im ila rly, t h e b it a t in d e x 1 is s e t o n ly if t h e file a cce s s re q u ire s writ e p rivile g e s . Co n ve rs e ly, fo r t h e open( ) s ys t e m ca ll, t h e va lu e o f t h e

O_WRONLY fla g is s t o re d in t h e b it a t in d e x 0 , wh ile t h e O_RDWR fla g is s t o re d in t h e b it a t in d e x 1 ; t h u s , t h e O_RDONLY fla g is t ru e wh e n b o t h b it s a re cle a re d . No t ice t h a t it is n o t p o s s ib le t o s p e cify in t h e open( ) s ys t e m ca ll t h a t a file a cce s s d o e s n o t re q u ire e it h e r re a d o r writ e p rivile g e s ; t h is m a ke s s e n s e , h o we ve r, in a p a t h n a m e lo o ku p o p e ra t io n in vo lvin g s ym b o lic lin ks .

nd Th e a d d re s s o f a struct nameidata d a t a s t ru ct u re .

Th e struct nameidata d a t a s t ru ct u re is fille d b y path_walk( ) wit h d a t a p e rt a in in g t o t h e p a t h n a m e lo o ku p o p e ra t io n . Th e fie ld s o f t h is s t ru ct u re a re s h o wn in Ta b le 1 2 - 1 4 .

Ta b le 1 2 - 1 4 . Th e fie ld s o f t h e n a m e id a t a d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct dentry *

dentry

Ad d re s s o f t h e d e n t ry o b je ct

struct vfs_mount * mnt

Ad d re s s o f t h e m o u n t e d file s ys t e m o b je ct

struct qstr

last

La s t co m p o n e n t o f t h e p a t h n a m e ( u s e d wh e n t h e LOOKUP_PARENT fla g is s e t )

unsigned int

flags

Lo o ku p fla g s

int

last_type

Typ e o f la s t co m p o n e n t o f t h e p a t h n a m e ( u s e d wh e n t h e LOOKUP_PARENT fla g is s e t )

Th e dentry a n d mnt fie ld s p o in t re s p e ct ive ly t o t h e d e n t ry o b je ct a n d t h e m o u n t e d file s ys t e m o b je ct o f t h e la s t re s o lve d co m p o n e n t in t h e p a t h n a m e . On ce path_walk( ) s u cce s s fu lly re t u rn s , t h e s e t wo fie ld s "d e s crib e " t h e file t h a t is id e n t ifie d b y t h e g ive n pa thna m e . Th e flags fie ld s t o re s t h e va lu e o f s o m e fla g s u s e d in t h e lo o ku p o p e ra t io n ; t h e y a re lis t e d in Ta b le 1 2 - 1 5 .

Ta b le 1 2 - 1 5 . Th e fla g s o f t h e lo o k u p o p e ra t io n

Ma c ro

D e s c rip t io n

LOOKUP_FOLLOW

If t h e la s t co m p o n e n t is a s ym b o lic lin k, in t e rp re t ( fo llo w) it .

LOOKUP_DIRECTORY Th e la s t co m p o n e n t m u s t b e a d ire ct o ry. LOOKUP_CONTINUE Th e re a re s t ill file n a m e s t o b e e xa m in e d in t h e p a t h n a m e ( u s e d o n ly b y NFS ) .

LOOKUP_POSITIVE Th e p a t h n a m e m u s t id e n t ify a n e xis t in g file . LOOKUP_PARENT

Lo o k u p t h e d ire ct o ry in clu d in g t h e la s t co m p o n e n t o f t h e p a t h n a m e .

LOOKUP_NOALT

Do n o t co n s id e r t h e e m u la t e d ro o t d ire ct o ry ( a lwa ys s e t fo r t h e 8 0 x 8 6 a rch it e ct u re ) .

Th e g o a l o f t h e path_init( ) fu n ct io n co n s is t s o f in it ia lizin g t h e nameidata s t ru ct u re , wh ich it d o e s in t h e fo llo win g m a n n e r: 1 . S e t s t h e dentry fie ld wit h t h e a d d re s s o f t h e d e n t ry o b je ct o f t h e d ire ct o ry wh e re t h e p a t h n a m e lo o ku p o p e ra t io n s t a rt s . If t h e p a t h n a m e is re la t ive ( it d o e s n 't s t a rt wit h a s la s h ) , t h e fie ld p o in t s t o t h e d e n t ry o f t h e wo rkin g d ire ct o ry ( current->fs-

>pwd) ; o t h e rwis e it p o in t s t o t h e d e n t ry o f t h e ro o t d ire ct o ry o f t h e p ro ce s s ( current->fs->root) . 2 . S e t s t h e mnt fie ld wit h t h e a d d re s s o f t h e m o u n t e d file s ys t e m o b je ct re la t ive t o t h e d ire ct o ry wh e re t h e p a t h n a m e lo o ku p o p e ra t io n s t a rt s : e it h e r current->fs-

>pwdmnt o r current->fs->rootmnt, a cco rd in g t o wh e t h e r t h e p a t h n a m e is re la t ive o r a b s o lu t e . 3 . In it ia lize s t h e flags fie ld wit h t h e va lu e o f t h e flags p a ra m e t e r.

4 . In it ia lize s t h e last_type fie ld t o LAST_ROOT.

On ce path_init( ) in it ia lize s t h e nameidata d a t a s t ru ct u re , t h e path_walk( ) fu n ct io n t a ke s ca re o f t h e lo o ku p o p e ra t io n , a n d s t o re s in t h e nameidata s t ru ct u re t h e p o in t e rs t o t h e d e n t ry o b je ct a n d m o u n t e d file s ys t e m o b je ct re la t ive t o t h e la s t co m p o n e n t o f t h e p a t h n a m e . Th e fu n ct io n a ls o in cre m e n t s t h e u s a g e co u n t e rs o f t h e o b je ct s re fe re n ce d b y nd-

>dentry a n d nd->mnt s o t h a t t h e ca lle r fu n ct io n m a y s a fe ly a cce s s t h e m o n ce path_walk( ) re t u rn s . Wh e n t h e ca lle r fin is h e s a cce s s in g t h e m , it in vo ke s t h e t h ird fu n ct io n o f t h e s e t , path_release( ), wh ich re ce ive s a s a p a ra m e t e r t h e a d d re s s o f t h e

nameidata d a t a s t ru ct u re a n d d e cre m e n t s t h e t wo u s a g e co u n t e rs o f nd->dentry a n d nd>mnt. We a re n o w re a d y t o d e s crib e t h e co re o f t h e p a t h n a m e lo o ku p o p e ra t io n , n a m e ly t h e path_walk( ) fu n ct io n . It re ce ive s a s p a ra m e t e rs a p o in t e r name t o t h e p a t h n a m e t o b e re s o lve d a n d t h e a d d re s s nd o f t h e nameidata d a t a s t ru ct u re . Th e fu n ct io n in it ia lize s t o ze ro t h e total_link_count o f t h e cu rre n t p ro ce s s ( s e e t h e la t e r s e ct io n S e ct io n 1 2 . 5 . 3 ) , a n d t h e n in vo ke s link_path_walk( ). Th is la t t e r fu n ct io n a ct s o n t h e s a m e t wo p a ra m e t e rs o f path_walk( ).

To m a ke t h in g s a b it e a s ie r, we firs t d e s crib e wh a t link_path_walk( ) d o e s wh e n

LOOKUP_PARENT is n o t s e t a n d t h e p a t h n a m e d o e s n o t co n t a in s ym b o lic lin ks ( s t a n d a rd p a t h n a m e lo o ku p ) . Ne xt , we d is cu s s t h e ca s e in wh ich LOOKUP_PARENT is s e t : t h is t yp e o f lo o ku p is re q u ire d wh e n cre a t in g , d e le t in g , o r re n a m in g a d ire ct o ry e n t ry, t h a t is , d u rin g a p a re n t p a t h n a m e lo o ku p . Fin a lly, we e xp la in h o w t h e fu n ct io n re s o lve s t h e s ym b o lic lin ks .

12.5.1 Standard Pathname Lookup Wh e n t h e LOOKUP_PARENT fla g is cle a re d , link_path_walk( ) p e rfo rm s t h e fo llo win g ste ps. 1 . In it ia lize s t h e lookup_flags lo ca l va ria b le wit h nd->flags.

2 . S kip s a n y le a d in g s la s h ( / ) b e fo re t h e firs t co m p o n e n t o f t h e p a t h n a m e . 3 . If t h e re m a in in g p a t h n a m e is e m p t y, re t u rn s t h e va lu e 0 . In t h e nameidata d a t a s t ru ct u re , t h e dentry a n d mnt fie ld s p o in t t o t h e o b je ct re la t ive t o t h e la s t re s o lve d co m p o n e n t o f t h e o rig in a l p a t h n a m e . 4 . If t h e link_count fie ld in t h e d e s crip t o r o f t h e cu rre n t p ro ce s s is p o s it ive , s e t s t h e

LOOKUP_FOLLOW fla g in t h e lookup_flags lo ca l va ria b le ( s e e S e ct io n 1 2 . 5 . 3 ) . 5 . Exe cu t e s a cycle t h a t b re a ks name in t o co m p o n e n t s ( t h e in t e rm e d ia t e s la s h e s a re t re a t e d a s file n a m e s e p a ra t o rs ) ; fo r e a ch co m p o n e n t fo u n d , t h e fu n ct io n : a . Re t rie ve s t h e a d d re s s o f t h e in o d e o b je ct o f t h e la s t re s o lve d co m p o n e n t fro m nd->dentry->d_inode.

b . Ch e cks t h a t t h e p e rm is s io n s o f t h e la s t re s o lve d co m p o n e n t s t o re d in t o t h e in o d e a llo w e xe cu t io n ( in Un ix, a d ire ct o ry ca n b e t ra ve rs e d o n ly if it is e xe cu t a b le ) . If t h e in o d e h a s a cu s t o m permission m e t h o d , t h e fu n ct io n e xe cu t e s it ; o t h e rwis e , it e xe cu t e s t h e vfs_permission( ) fu n ct io n , wh ich e xa m in e s t h e a cce s s m o d e s t o re d in t h e i_mode in o d e fie ld a n d t h e p rivile g e s o f t h e ru n n in g p ro ce s s . c. Co n s id e rs t h e n e xt co m p o n e n t t o b e re s o lve d . Fro m it s n a m e , it co m p u t e s a h a s h va lu e fo r t h e d e n t ry ca ch e h a s h t a b le .

d . S kip s a n y t ra ilin g s la s h ( / ) a ft e r t h e s la s h t h a t t e rm in a t e s t h e n a m e o f t h e co m p o n e n t t o b e re s o lve d . e . If t h e co m p o n e n t t o b e re s o lve d is t h e la s t o n e in t h e o rig in a l p a t h n a m e , ju m p t o S t e p 6 . f. If t h e n a m e o f t h e co m p o n e n t is ". " ( a s in g le d o t ) , co n t in u e s wit h t h e n e xt co m p o n e n t ( ". " re fe rs t o t h e cu rre n t d ire ct o ry, s o it h a s n o e ffe ct in s id e a pa thna m e ). g . If t h e n a m e o f t h e co m p o n e n t is ". . " ( t wo d o t s ) , t rie s t o clim b t o t h e p a re n t d ire ct o ry: 1 . If t h e la s t re s o lve d d ire ct o ry is t h e p ro ce s s 's ro o t d ire ct o ry ( nd-

>dentry is e q u a l t o current->fs->root a n d nd->mnt is e q u a l t o current->fs->rootmnt) , co n t in u e s wit h t h e n e xt co m p o n e n t . 2 . If t h e la s t re s o lve d d ire ct o ry is t h e ro o t d ire ct o ry o f a m o u n t e d file s ys t e m ( nd->dentry is e q u a l t o nd->mnt->mnt_root) , s e t s nd-

>mnt t o nd->mnt->mnt_parent a n d nd->dentry t o nd->mnt>mnt_mountpoint, a n d t h e n re s t a rt s S t e p 5 . g . ( Re ca ll t h a t s e ve ra l file s ys t e m s ca n b e m o u n t e d o n t h e s a m e m o u n t p o in t ) . 3 . If t h e la s t re s o lve d d ire ct o ry is n o t t h e ro o t d ire ct o ry o f a m o u n t e d file s ys t e m , s e t s nd->dentry t o nd->dentry->d_parent a n d co n t in u e s wit h t h e n e xt co m p o n e n t . h . Th e co m p o n e n t n a m e is n e it h e r ". " n o r ". . ", s o t h e fu n ct io n m u s t lo o k it u p in t h e d e n t ry ca ch e . If t h e lo w- le ve l file s ys t e m h a s a cu s t o m d_hash d e n t ry m e t h o d , t h e fu n ct io n in vo ke s it t o m o d ify t h e h a s h va lu e a lre a d y co m p u t e d in S t e p 5 . c. i. In vo ke s cached_lookup( ), p a s s in g a s p a ra m e t e rs nd->dentry, t h e n a m e o f t h e co m p o n e n t t o b e re s o lve d , t h e h a s h va lu e , a n d t h e LOOKUP_CONTINUE fla g , wh ich s p e cifie s t h a t t h is is n o t t h e la s t co m p o n e n t o f t h e p a t h n a m e . Th e fu n ct io n in vo ke s d_lookup( ) t o s e a rch t h e d e n t ry o b je ct o f t h e co m p o n e n t in t h e d e n t ry ca ch e . If cached_lookup( ) fa ils in fin d in g t h e d e n t ry in t h e ca ch e , link_walk_path( ) in vo ke s

real_lookup( ) t o re a d t h e d ire ct o ry fro m d is k a n d cre a t e a n e w d e n t ry o b je ct . In e it h e r ca s e , we ca n a s s u m e a t t h e e n d o f t h is s t e p t h a t t h e

dentry lo ca l va ria b le p o in t s t o t h e d e n t ry o b je ct o f t h e co m p o n e n t n a m e t o b e re s o lve d in t h is cycle . j. Ch e cks wh e t h e r t h e co m p o n e n t ju s t re s o lve d ( dentry lo ca l va ria b le ) re fe rs t o a d ire ct o ry t h a t is a m o u n t p o in t fo r s o m e file s ys t e m ( dentry-

>d_mounted is s e t t o 1 ) . In t h is ca s e , in vo ke s lookup_mnt( ), p a s s in g t o it dentry a n d nd->mnt, in o rd e r t o g e t t h e a d d re s s mounted o f t h e ch ild m o u n t e d file s ys t e m o b je ct . Ne xt , it s e t s dentry t o mounted->mnt_root a n d nd->mnt t o mounted. Th e n it re p e a t s t h e wh o le s t e p ( s e ve ra l file s ys t e m s ca n b e m o u n t e d o n t h e s a m e m o u n t p o in t ) .

k. Ch e cks wh e t h e r t h e in o d e o b je ct dentry->d_inode h a s a cu s t o m

follow_link m e t h o d . If t h is is t h e ca s e , t h e co m p o n e n t is a s ym b o lic lin k, wh ich is d e s crib e d in t h e la t e r s e ct io n S e ct io n 1 2 . 5 . 3 . l. Ch e cks t h a t dentry p o in t s t o t h e d e n t ry o b je ct o f a d ire ct o ry ( dentry-

>d_inode->i_op->lookup m e t h o d is d e fin e d ) . If n o t , re t u rn s t h e e rro r ENOTDIR, b e ca u s e t h e co m p o n e n t is in t h e m id d le o f t h e o rig in a l p a t h n a m e . m . S e t s nd->dentry t o dentry a n d co n t in u e s wit h t h e n e xt co m p o n e n t o f t h e pa thna m e . 6 . No w a ll co m p o n e n t s o f t h e o rig in a l p a t h n a m e a re re s o lve d e xce p t t h e la s t o n e . If t h e p a t h n a m e h a s a t ra ilin g s la s h , it s e t s t h e LOOKUP_FOLLOW a n d LOOKUP_DIRECTORY in t h e lookup_flags lo ca l va ria b le t o fo rce in t e rp re t a t io n o f t h e la s t co m p o n e n t a s a d ire ct o ry n a m e . 7 . Ch e cks t h e va lu e o f t h e LOOKUP_PARENT fla g in t h e lookup_flags va ria b le . In t h e fo llo win g , we a s s u m e t h a t t h e fla g is s e t t o 0 , a n d we p o s t p o n e t h e o p p o s it e ca s e t o t h e n e xt s e ct io n . 8 . If t h e n a m e o f t h e la s t co m p o n e n t is ". " ( a s in g le d o t ) , t e rm in a t e s t h e e xe cu t io n re t u rn in g t h e va lu e 0 ( n o e rro r) . In t h e nameidata s t ru ct u re t h a t nd p o in t s t o , t h e

dentry a n d mnt fie ld s re fe r t o t h e o b je ct s re la t ive t o t h e n e xt - t o - la s t co m p o n e n t o f t h e p a t h n a m e ( a n y co m p o n e n t ". " h a s n o e ffe ct in s id e a p a t h n a m e ) . 9 . If t h e n a m e o f t h e la s t co m p o n e n t is ". . " ( t wo d o t s ) , t rie s t o clim b t o t h e p a re n t d ire ct o ry: a . If t h e la s t re s o lve d d ire ct o ry is t h e p ro ce s s 's ro o t d ire ct o ry ( nd->dentry is e q u a l t o current->fs->root a n d nd->mnt is e q u a l t o current->fs-

>rootmnt) , t e rm in a t e s t h e e xe cu t io n re t u rn in g t h e va lu e 0 ( n o e rro r) . nd>dentry a n d nd->mnt re fe r t o t h e o b je ct s re la t ive t o t h e n e xt t o t h e la s t co m p o n e n t o f t h e p a t h n a m e —t h a t is , t o t h e ro o t d ire ct o ry o f t h e p ro ce s s . b . If t h e la s t re s o lve d d ire ct o ry is t h e ro o t d ire ct o ry o f a m o u n t e d file s ys t e m ( nd->dentry is e q u a l t o nd->mnt->mnt_root) , s e t s nd->mnt t o nd->mnt-

>mnt_parent a n d nd->dentry t o nd->mnt->mnt_mountpoint, a n d t h e n re s t a rt s S t e p 5 . j. c. If t h e la s t re s o lve d d ire ct o ry is n o t t h e ro o t d ire ct o ry o f a m o u n t e d file s ys t e m , s e t s nd->dentry t o nd->dentry->d_parent, a n d t e rm in a t e s t h e e xe cu t io n re t u rn in g t h e va lu e 0 ( n o e rro r) . nd->dentry a n d nd->mnt re fe r t o t h e o b je ct s re la t ive t o t h e n e xt - t o - la s t co m p o n e n t o f t h e p a t h n a m e . 1 0 . Th e n a m e o f t h e la s t co m p o n e n t is n e it h e r ". " n o r ". . ", s o t h e fu n ct io n m u s t lo o k it u p in t h e d e n t ry ca ch e . If t h e lo w- le ve l file s ys t e m h a s a cu s t o m d_hash d e n t ry m e t h o d , t h e fu n ct io n in vo ke s it t o m o d ify t h e h a s h va lu e a lre a d y co m p u t e d in S t e p 5 . c.

1 1 . In vo ke s cached_lookup( ), p a s s in g a s p a ra m e t e rs nd->dentry, t h e n a m e o f t h e co m p o n e n t t o b e re s o lve d , t h e h a s h va lu e , a n d n o fla g ( LOOKUP_CONTINUE is n o t s e t b e ca u s e t h is is t h e la s t co m p o n e n t o f t h e p a t h n a m e ) . If cached_lookup( ) fa ils in fin d in g t h e d e n t ry in t h e ca ch e , it a ls o in vo ke s real_lookup( ) t o re a d t h e d ire ct o ry fro m d is k a n d cre a t e a n e w d e n t ry o b je ct . In e it h e r ca s e , we ca n a s s u m e a t t h e e n d o f t h is s t e p t h a t t h e dentry lo ca l va ria b le p o in t s t o t h e d e n t ry o b je ct o f t h e co m p o n e n t n a m e t o b e re s o lve d in t h is cycle . 1 2 . Ch e cks wh e t h e r t h e co m p o n e n t ju s t re s o lve d ( dentry lo ca l va ria b le ) re fe rs t o a d ire ct o ry t h a t is a m o u n t p o in t fo r s o m e file s ys t e m ( dentry->d_mounted is s e t t o 1 ) . In t h is ca s e , in vo ke s lookup_mnt( ), p a s s in g t o it dentry a n d nd->mnt, in o rd e r t o g e t t h e a d d re s s mounted o f t h e ch ild m o u n t e d file s ys t e m o b je ct . Ne xt , it s e t s dentry t o mounted->mnt_root a n d nd->mnt t o mounted. Th e n it re p e a t s t h e wh o le s t e p ( b e ca u s e s e ve ra l file s ys t e m s ca n b e m o u n t e d o n t h e s a m e m o u n t p o in t ) . 1 3 . Ch e cks wh e t h e r LOOKUP_FOLLOW fla g is s e t in lookup_flags a n d t h e in o d e o b je ct

dentry->d_inode h a s a cu s t o m follow_link m e t h o d . If t h is is t h e ca s e , t h e co m p o n e n t is a s ym b o lic lin k t h a t m u s t b e in t e rp re t e d , a s d e s crib e d in t h e la t e r s e ct io n S e ct io n 1 2 . 5 . 3 . 1 4 . S e t s nd->dentry wit h t h e va lu e s t o re d in t h e dentry lo ca l va ria b le . Th is d e n t ry o b je ct is "t h e re s u lt " o f t h e lo o ku p o p e ra t io n . 1 5 . Ch e cks wh e t h e r nd->dentry->d_inode is NULL. Th is h a p p e n s wh e n t h e re is n o in o d e a s s o cia t e d wit h t h e d e n t ry o b je ct , u s u a lly b e ca u s e t h e p a t h n a m e re fe rs t o a n o n e xis t in g file . In t h is ca s e : a . If e it h e r LOOKUP_POSITIVE o r LOOKUP_DIRECTORY is s e t in

lookup_flags, it t e rm in a t e s , re t u rn in g t h e e rro r co d e -ENOENT. b . Ot h e rwis e , it t e rm in a t e s re t u rn in g t h e va lu e 0 ( n o e rro r) . nd->dentry p o in t s t o t h e n e g a t ive d e n t ry o b je ct cre a t e d b y t h e lo o ku p o p e ra t io n . 1 6 . Th e re is a n in o d e a s s o cia t e d wit h t h e la s t co m p o n e n t o f t h e p a t h n a m e . If t h e LOOKUP_DIRECTORY fla g is s e t in lookup_flags, ch e cks t h a t t h e in o d e h a s a cu s t o m lookup m e t h o d —t h a t is , it is a d ire ct o ry. If n o t , t e rm in a t e s re t u rn in g t h e e rro r co d e -ENOTDIR.

1 7 . Te rm in a t e s re t u rn in g t h e va lu e 0 ( n o e rro r) . nd->dentry a n d nd->mnt re fe r t o t h e la s t co m p o n e n t o f t h e p a t h n a m e .

12.5.2 Parent Pathname Lookup In m a n y ca s e s , t h e re a l t a rg e t o f a lo o ku p o p e ra t io n is n o t t h e la s t co m p o n e n t o f t h e p a t h n a m e , b u t t h e n e xt - t o - la s t o n e . Fo r e xa m p le , wh e n a file is cre a t e d , t h e la s t co m p o n e n t d e n o t e s t h e file n a m e o f t h e n o t ye t e xis t in g file , a n d t h e re s t o f t h e p a t h n a m e s p e cifie s t h e d ire ct o ry in wh ich t h e n e w lin k m u s t b e in s e rt e d . Th e re fo re , t h e lo o ku p o p e ra t io n s h o u ld fe t ch t h e d e n t ry o b je ct o f t h e n e xt - t o - la s t co m p o n e n t . Fo r a n o t h e r e xa m p le , u n lin kin g a file

id e n t ifie d b y t h e p a t h n a m e / fo o / b a r co n s is t s o f re m o vin g b a r fro m t h e d ire ct o ry fo o . Th u s , t h e ke rn e l is re a lly in t e re s t e d in a cce s s in g t h e file d ire ct o ry fo o ra t h e r t h a n b a r. Th e LOOKUP_PARENT fla g is u s e d wh e n e ve r t h e lo o ku p o p e ra t io n m u s t re s o lve t h e d ire ct o ry co n t a in in g t h e la s t co m p o n e n t o f t h e p a t h n a m e , ra t h e r t h a n t h e la s t co m p o n e n t it s e lf. Wh e n t h e LOOKUP_PARENT fla g is s e t , t h e path_walk( ) fu n ct io n a ls o s e t s u p t h e last a n d last_type fie ld s o f t h e nameidata d a t a s t ru ct u re . Th e last fie ld s t o re s t h e n a m e o f t h e la s t co m p o n e n t in t h e p a t h n a m e . Th e last_type fie ld id e n t ifie s t h e t yp e o f t h e la s t co m p o n e n t ; it m a y b e s e t t o o n e o f t h e va lu e s s h o wn in Ta b le 1 2 - 1 6 .

Ta b le 1 2 - 1 6 . Th e v a lu e s o f t h e la s t _ t y p e fie ld in t h e n a m e id a t a d a t a s t ru c t u re

Va lu e

D e s c rip t io n

LAST_NORM

La s t co m p o n e n t is a re g u la r file n a m e

LAST_ROOT

La s t co m p o n e n t is "/ " ( t h a t is , t h e e n t ire p a t h n a m e is "/ ")

LAST_DOT

La s t co m p o n e n t is ". "

LAST_DOTDOT

La s t co m p o n e n t is ". . "

LAST_BIND

La s t co m p o n e n t is a s ym b o lic lin k in t o a s p e cia l file s ys t e m

Th e LAST_ROOT fla g is t h e d e fa u lt va lu e s e t b y path_init( ) wh e n t h e wh o le p a t h n a m e lo o ku p o p e ra t io n s t a rt s ( s e e t h e d e s crip t io n a t t h e b e g in n in g o f S e ct io n 1 2 . 5 ) . If t h e p a t h n a m e t u rn s o u t t o b e ju s t "/ ", t h e ke rn e l d o e s n o t ch a n g e t h e in it ia l va lu e o f t h e last_type fie ld . Th e LAST_BIND fla g is s e t b y t h e follow_link in o d e o b je ct 's m e t h o d o f s ym b o lic lin ks in s p e cia l file s ys t e m s ( s e e t h e n e xt s e ct io n ) . Th e re m a in in g va lu e s o f t h e last_type fie ld a re s e t b y link_path_walk( ) wh e n t h e

LOOKUP_PARENT fla g is o n ; in t h is ca s e , t h e fu n ct io n p e rfo rm s t h e s a m e s t e p s d e s crib e d in t h e p re vio u s s e ct io n u p t o S t e p 7 . Fro m S t e p 7 o n wa rd , h o we ve r, t h e lo o ku p o p e ra t io n fo r t h e la s t co m p o n e n t o f t h e p a t h n a m e is d iffe re n t : 1 . S e t s nd->last t o t h e n a m e o f t h e la s t co m p o n e n t

2 . In it ia lize s nd->last_type t o LAST_NORM

3 . If t h e n a m e o f t h e la s t co m p o n e n t is ". " ( a s in g le d o t ) , s e t s nd->last_type t o

LAST_DOT

4 . If t h e n a m e o f t h e la s t co m p o n e n t is ". . " ( t wo d o t s ) , s e t s nd->last_type t o

LAST_DOTDOT 5 . Te rm in a t e s b y re t u rn in g t h e va lu e 0 ( n o e rro r) As yo u ca n s e e , t h e la s t co m p o n e n t is n o t in t e rp re t e d a t a ll. Th u s , wh e n t h e fu n ct io n t e rm in a t e s , t h e dentry a n d mnt fie ld s o f t h e nameidata d a t a s t ru ct u re p o in t t o t h e o b je ct s re la t ive t o t h e d ire ct o ry t h a t in clu d e s t h e la s t co m p o n e n t .

12.5.3 Lookup of Symbolic Links Re ca ll t h a t a s ym b o lic lin k is a re g u la r file t h a t s t o re s a p a t h n a m e o f a n o t h e r file . A p a t h n a m e m a y in clu d e s ym b o lic lin ks , a n d t h e y m u s t b e re s o lve d b y t h e ke rn e l. Fo r e xa m p le , if / fo o / b a r is a s ym b o lic lin k p o in t in g t o ( co n t a in in g t h e p a t h n a m e ) . . / d ir, t h e p a t h n a m e / fo o / b a r/ file m u s t b e re s o lve d b y t h e ke rn e l a s a re fe re n ce t o t h e file / d ir/ file . In t h is e xa m p le , t h e ke rn e l m u s t p e rfo rm t wo d iffe re n t lo o ku p o p e ra t io n s . Th e firs t o n e re s o lve s / fo o / b a r; wh e n t h e ke rn e l d is co ve rs t h a t b a r is t h e n a m e o f a s ym b o lic lin k, it m u s t re t rie ve it s co n t e n t a n d in t e rp re t it a s a n o t h e r p a t h n a m e . Th e s e co n d p a t h n a m e o p e ra t io n s t a rt s fro m t h e d ire ct o ry re a ch e d b y t h e firs t o p e ra t io n a n d co n t in u e s u n t il t h e la s t co m p o n e n t o f t h e s ym b o lic lin k p a t h n a m e h a s b e e n re s o lve d . Ne xt , t h e o rig in a l lo o ku p o p e ra t io n re s u m e s fro m t h e d e n t ry re a ch e d in t h e s e co n d o n e a n d wit h t h e co m p o n e n t fo llo win g t h e s ym b o lic lin k in t h e o rig in a l p a t h n a m e . To fu rt h e r co m p lica t e t h e s ce n a rio , t h e p a t h n a m e in clu d e d in a s ym b o lic lin k m a y in clu d e o t h e r s ym b o lic lin ks . Yo u m ig h t t h in k t h a t t h e ke rn e l co d e t h a t re s o lve s t h e s ym b o lic lin ks is h a rd t o u n d e rs t a n d , b u t t h is is n o t t ru e ; t h e co d e is a ct u a lly q u it e s im p le b e ca u s e it is re cu rs ive . Ho we ve r, u n t a m e d re cu rs io n is in t rin s ica lly d a n g e ro u s . Fo r in s t a n ce , s u p p o s e t h a t a s ym b o lic lin k p o in t s t o it s e lf. Of co u rs e , re s o lvin g a p a t h n a m e in clu d in g s u ch a s ym b o lic lin k m a y in d u ce a n e n d le s s s t re a m o f re cu rs ive in vo ca t io n s , wh ich in t u rn q u ickly le a d s t o a ke rn e l s t a ck o ve rflo w. Th e link_count fie ld in t h e d e s crip t o r o f t h e cu rre n t p ro ce s s is u s e d t o a vo id t h e p ro b le m : t h e fie ld is in cre m e n t e d b e fo re e a ch re cu rs ive e xe cu t io n a n d d e cre m e n t e d rig h t a ft e r. If t h e fie ld re a ch e s t h e va lu e 5 , t h e wh o le lo o ku p o p e ra t io n t e rm in a t e s wit h a n e rro r co d e . Th e re fo re , t h e le ve l o f n e s t in g o f s ym b o lic lin ks ca n b e a t m ost 5. Fu rt h e rm o re , t h e total_link_count fie ld in t h e d e s crip t o r o f t h e cu rre n t p ro ce s s ke e p s t ra ck o f h o w m a n y s ym b o lic lin ks ( e ve n n o n n e s t e d ) we re fo llo we d in t h e o rig in a l lo o ku p o p e ra t io n . If t h is co u n t e r re a ch e s t h e va lu e 4 0 , t h e lo o ku p o p e ra t io n a b o rt s . Wit h o u t t h is co u n t e r, a m a licio u s u s e r co u ld cre a t e a p a t h o lo g ica l p a t h n a m e in clu d in g m a n y co n s e cu t ive s ym b o lic lin ks t h a t fre e ze s t h e ke rn e l in a ve ry lo n g lo o ku p o p e ra t io n . Th is is h o w t h e co d e b a s ica lly wo rks : o n ce t h e link_path_walk( ) fu n ct io n re t rie ve s t h e d e n t ry o b je ct a s s o cia t e d wit h a co m p o n e n t o f t h e p a t h n a m e , it ch e cks wh e t h e r t h e co rre s p o n d in g in o d e o b je ct h a s a cu s t o m follow_link m e t h o d ( s e e S t e p 5 . k a n d S t e p 1 3 in S e ct io n 1 2 . 5 . 1 ) . If s o , t h e in o d e is a s ym b o lic lin k t h a t m u s t b e in t e rp re t e d b e fo re p ro ce e d in g wit h t h e lo o ku p o p e ra t io n o f t h e o rig in a l p a t h n a m e . In t h is ca s e , t h e link_path_walk( ) fu n ct io n in vo ke s do_follow_link( ), p a s s in g t o it t h e a d d re s s o f t h e d e n t ry o b je ct o f t h e s ym b o lic lin k a n d t h e a d d re s s o f t h e nameidata d a t a

s t ru ct u re . In t u rn , do_follow_link( ) p e rfo rm s t h e fo llo win g s t e p s :

1 . Ch e cks t h a t current->link_count is le s s t h a n 5 ; o t h e rwis e , re t u rn s t h e e rro r co d e -ELOOP

2 . Ch e cks t h a t current->total_link_count is le s s t h a n 4 0 ; o t h e rwis e , re t u rn s t h e e rro r co d e -ELOOP

3 . If t h e current->need_resched fla g is s e t , in vo ke s schedule( ) t o g ive a ch a n ce t o p re e m p t t h e ru n n in g p ro ce s s 4 . In cre m e n t s current->link_count a n d current->total_link_count

5 . Up d a t e s t h e a cce s s t im e o f t h e in o d e o b je ct a s s o cia t e d wit h t h e s ym b o lic lin k t o b e re s o lve d 6 . In vo ke s t h e follow_link m e t h o d o f t h e in o d e , p a s s in g t o it t h e a d d re s s e s o f t h e d e n t ry o b je ct a n d o f t h e nameidata d a t a s t ru ct u re

7 . De cre m e n t s t h e current->link_count fie ld

8 . Re t u rn s t h e e rro r co d e re t u rn e d b y t h e follow_link m e t h o d ( 0 fo r n o e rro r)

Th e follow_link m e t h o d is a file s ys t e m - d e p e n d e n t fu n ct io n t h a t re a d s t h e p a t h n a m e s t o re d in t h e s ym b o lic lin k fro m t h e d is k. Ha vin g fille d a b u ffe r wit h t h e s ym b o lic lin k's p a t h n a m e , m o s t follow_link m e t h o d s e n d u p in vo kin g t h e vfs_follow_link( ) fu n ct io n a n d re t u rn in g t h e va lu e t a ke n fro m it . In t u rn , t h e vfs_follow_link( ) d o e s t h e fo llo win g : 1 . Ch e cks wh e t h e r t h e firs t ch a ra ct e r o f t h e s ym b o lic lin k p a t h n a m e is a s la s h ; if s o , t h e dentry a n d mnt fie ld s o f t h e nameidata d a t a s t ru ct u re a re s e t s o t h e y re fe r t o t h e cu rre n t p ro ce s s ro o t d ire ct o ry. 2 . In vo ke s link_path_walk( ) t o re s o lve t h e s ym b o lic lin k p a t h n a m e , p a s s in g t o it t h e nameidata d a t a s t ru ct u re .

3 . Re t u rn s t h e va lu e t a ke n fro m link_path_walk( ).

Wh e n do_follow_link( ) fin a lly t e rm in a t e s , it re t u rn s t h e a d d re s s o f t h e d e n t ry o b je ct re fe rre d t o b y t h e s ym b o lic lin k t o t h e o rig in a l e xe cu t io n o f link_path_walk( ). Th e

link_path_walk( ) a s s ig n s t h is a d d re s s t o t h e dentry lo ca l va ria b le , a n d t h e n p ro ce e d s wit h t h e n e xt s t e p . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.6 Implementations of VFS System Calls Fo r t h e s a ke o f b re vit y, we ca n n o t d is cu s s t h e im p le m e n t a t io n o f a ll t h e VFS s ys t e m ca lls lis t e d in Ta b le 1 2 - 1 . Ho we ve r, it co u ld b e u s e fu l t o s ke t ch o u t t h e im p le m e n t a t io n o f a fe w s ys t e m ca lls , ju s t t o s h o w h o w VFS 's d a t a s t ru ct u re s in t e ra ct . Le t 's re co n s id e r t h e e xa m p le p ro p o s e d a t t h e b e g in n in g o f t h is ch a p t e r: a u s e r is s u e s a s h e ll co m m a n d t h a t co p ie s t h e MS - DOS file / flo p p y / TES T t o t h e Ext 2 file / t m p / t e s t . Th e co m m a n d s h e ll in vo ke s a n e xt e rn a l p ro g ra m like cp , wh ich we a s s u m e e xe cu t e s t h e fo llo win g co d e fra g m e n t :

inf = open("/floppy/TEST", O_RDONLY, 0); outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600); do { len = read(inf, buf, 4096); write(outf, buf, len); } while (len); close(outf); close(inf); Act u a lly, t h e co d e o f t h e re a l cp p ro g ra m is m o re co m p lica t e d , s in ce it m u s t a ls o ch e ck fo r p o s s ib le e rro r co d e s re t u rn e d b y e a ch s ys t e m ca ll. In o u r e xa m p le , we ju s t fo cu s o u r a t t e n t io n o n t h e "n o rm a l" b e h a vio r o f a co p y o p e ra t io n .

12.6.1 The open( ) System Call Th e open( ) s ys t e m ca ll is s e rvice d b y t h e sys_open( ) fu n ct io n , wh ich re ce ive s a s p a ra m e t e rs t h e p a t h n a m e filename o f t h e file t o b e o p e n e d , s o m e a cce s s m o d e fla g s

flags, a n d a p e rm is s io n b it m a s k mode if t h e file m u s t b e cre a t e d . If t h e s ys t e m ca ll s u cce e d s , it re t u rn s a file d e s crip t o r—t h a t is , t h e in d e x a s s ig n e d t o t h e n e w file in t h e current->files->fd a rra y o f p o in t e rs t o file o b je ct s ; o t h e rwis e , it re t u rn s - 1 .

In o u r e xa m p le , open( ) is in vo ke d t wice ; t h e firs t t im e t o o p e n / flo p p y / TES T fo r re a d in g ( O_RDONLY fla g ) a n d t h e s e co n d t im e t o o p e n / t m p / t e s t fo r writ in g ( O_WRONLY fla g ) . If / t m p / t e s t d o e s n o t a lre a d y e xis t , it is cre a t e d ( O_CREAT fla g ) wit h e xclu s ive re a d a n d writ e a cce s s fo r t h e o wn e r ( o ct a l 0600 n u m b e r in t h e t h ird p a ra m e t e r) .

Co n ve rs e ly, if t h e file a lre a d y e xis t s , it is re writ t e n fro m s cra t ch ( O_TRUNC fla g ) . Ta b le 1 2 - 1 7 lis t s a ll fla g s o f t h e open( ) s ys t e m ca ll.

Ta b le 1 2 - 1 7 . Th e fla g s o f t h e o p e n ( ) s y s t e m c a ll

Fla g n a m e

D e s c rip t io n

O_RDONLY

Op e n fo r re a d in g

O_WRONLY

Op e n fo r writ in g

O_RDWR

Op e n fo r b o t h re a d in g a n d writ in g

O_CREAT

Cre a t e t h e file if it d o e s n o t e xis t

O_EXCL

Wit h O_CREAT, fa il if t h e file a lre a d y e xis t s

O_NOCTTY

Ne ve r co n s id e r t h e file a s a co n t ro llin g t e rm in a l

O_TRUNC

Tru n ca t e t h e file ( re m o ve a ll e xis t in g co n t e n t s )

O_APPEND

Alwa ys writ e a t e n d o f t h e file

O_NONBLOCK

No s ys t e m ca lls will b lo ck o n t h e file

O_NDELAY

S a m e a s O_NONBLOCK

O_SYNC

S yn ch ro n o u s writ e ( b lo ck u n t il p h ys ica l writ e t e rm in a t e s )

FASYNC

As yn ch ro n o u s I/ O n o t ifica t io n via s ig n a ls

O_DIRECT

Dire ct I/ O t ra n s fe r ( n o ke rn e l b u ffe rin g )

O_LARGEFILE

La rg e file ( s ize g re a t e r t h a n 2 GB)

O_DIRECTORY

Fa il if file is n o t a d ire ct o ry

O_NOFOLLOW

Do n o t fo llo w a t ra ilin g s ym b o lic lin k in p a t h n a m e

Le t 's d e s crib e t h e o p e ra t io n o f t h e sys_open( ) fu n ct io n . It p e rfo rm s t h e fo llo win g s t e p s :

1 . In vo ke s getname( ) t o re a d t h e file p a t h n a m e fro m t h e p ro ce s s a d d re s s s p a ce .

2 . In vo ke s get_unused_fd( ) t o fin d a n e m p t y s lo t in current->files->fd. Th e co rre s p o n d in g in d e x ( t h e n e w file d e s crip t o r) is s t o re d in t h e fd lo ca l va ria b le .

3 . In vo ke s t h e filp_open( ) fu n ct io n , p a s s in g a s p a ra m e t e rs t h e p a t h n a m e , t h e a cce s s m o d e fla g s , a n d t h e p e rm is s io n b it m a s k. Th is fu n ct io n , in t u rn , e xe cu t e s t h e fo llo win g s t e p s :

a . Co p ie s t h e a cce s s m o d e fla g s in t o namei_flags, b u t e n co d e s t h e a cce s s m o d e fla g s O_RDONLY, O_WRONLY, a n d O_RDWR wit h t h e fo rm a t e xp e ct e d b y t h e p a t h n a m e lo o ku p fu n ct io n s ( s e e t h e e a rlie r s e ct io n S e ct io n 1 2 . 5 ) . b . In vo ke s open_namei( ), p a s s in g t o it t h e p a t h n a m e , t h e m o d ifie d a cce s s m o d e fla g s , a n d t h e a d d re s s o f a lo ca l nameidata d a t a s t ru ct u re . Th e fu n ct io n p e rfo rm s t h e lo o ku p o p e ra t io n in t h e fo llo win g m a n n e r:



If O_CREAT is n o t s e t in t h e a cce s s m o d e fla g s , s t a rt s t h e lo o ku p o p e ra t io n wit h t h e LOOKUP_PARENT fla g n o t s e t . Mo re o ve r, t h e

LOOKUP_FOLLOW fla g is s e t o n ly if O_NOFOLLOW is cle a re d , wh ile t h e LOOKUP_DIRECTORY fla g is s e t o n ly if t h e O_DIRECTORY fla g is s e t .



If O_CREAT is s e t in t h e a cce s s m o d e fla g s , s t a rt s t h e lo o ku p o p e ra t io n wit h t h e LOOKUP_PARENT fla g s e t . On ce t h e path_walk(

) fu n ct io n s u cce s s fu lly re t u rn s , ch e cks wh e t h e r t h e re q u e s t e d file a lre a d y e xis t s . If n o t , a llo ca t e s a n e w d is k in o d e b y in vo kin g t h e create m e t h o d o f t h e p a re n t in o d e .

Th e open_namei( ) fu n ct io n a ls o e xe cu t e s s e ve ra l s e cu rit y ch e cks o n t h e file lo ca t e d b y t h e lo o ku p o p e ra t io n . Fo r in s t a n ce , t h e fu n ct io n ch e cks wh e t h e r t h e in o d e a s s o cia t e d wit h t h e d e n t ry o b je ct fo u n d re a lly e xis t s , wh e t h e r it is a re g u la r file , a n d wh e t h e r t h e cu rre n t p ro ce s s is a llo we d t o a cce s s it a cco rd in g t o t h e a cce s s m o d e fla g s . Als o , if t h e file is o p e n e d fo r writ in g , t h e fu n ct io n ch e cks t h a t t h e file is n o t lo cke d b y o t h e r p ro ce s s e s . c. In vo ke s t h e dentry_open( ) fu n ct io n , p a s s in g t o it t h e a cce s s m o d e fla g s a n d t h e a d d re s s e s o f t h e d e n t ry o b je ct a n d t h e m o u n t e d file s ys t e m o b je ct lo ca t e d b y t h e lo o ku p o p e ra t io n . In t u rn , t h is fu n ct io n : 1 . Allo ca t e s a n e w file o b je ct . 2 . In it ia lize s t h e f_flags a n d f_mode fie ld s o f t h e file o b je ct a cco rd in g t o t h e a cce s s m o d e fla g s p a s s e d t o t h e open( ) s ys t e m ca ll.

3 . In it ia lize s t h e f_fentry a n d f_vfsmnt fie ld s o f t h e file o b je ct a cco rd in g t o t h e a d d re s s e s o f t h e d e n t ry o b je ct a n d t h e m o u n t e d file s ys t e m o b je ct p a s s e d a s p a ra m e t e rs . 4 . S e t s t h e f_op fie ld t o t h e co n t e n t s o f t h e i_fop fie ld o f t h e co rre s p o n d in g in o d e o b je ct . Th is s e t s u p a ll t h e m e t h o d s fo r fu t u re file o p e ra t io n s . 5 . In s e rt s t h e file o b je ct in t o t h e lis t o f o p e n e d file s p o in t e d t o b y t h e s_files fie ld o f t h e file s ys t e m 's s u p e rb lo ck.

6 . If t h e O_DIRECT fla g is s e t , p re a llo ca t e s a d ire ct a cce s s b u ffe r ( s e e S e ct io n 1 5 . 3 ) .

7 . If t h e open m e t h o d o f t h e file o p e ra t io n s is d e fin e d , in vo ke s it .

d . Re t u rn s t h e a d d re s s o f t h e file o b je ct . 4 . S e t s current->files->fd[fd] t o t h e a d d re s s o f t h e file o b je ct re t u rn e d b y

dentry_open( ). 5 . Re t u rn s fd .

12.6.2 The read( ) and write( ) System Calls Le t 's re t u rn t o t h e co d e in o u r cp e xa m p le . Th e open( ) s ys t e m ca lls re t u rn t wo file d e s crip t o rs , wh ich a re s t o re d in t h e inf a n d outf va ria b le s . Th e n t h e p ro g ra m s t a rt s a lo o p : a t e a ch it e ra t io n , a p o rt io n o f t h e / flo p p y / TES T file is co p ie d in t o a lo ca l b u ffe r ( read(

) s ys t e m ca ll) , a n d t h e n t h e d a t a in t h e lo ca l b u ffe r is writ t e n in t o t h e / t m p / t e s t file ( write( ) s ys t e m ca ll) . Th e read( ) a n d write( ) s ys t e m ca lls a re q u it e s im ila r. Bo t h re q u ire t h re e p a ra m e t e rs : a file d e s crip t o r fd, t h e a d d re s s buf o f a m e m o ry a re a ( t h e b u ffe r co n t a in in g t h e d a t a t o b e t ra n s fe rre d ) , a n d a n u m b e r count t h a t s p e cifie s h o w m a n y b yt e s s h o u ld b e t ra n s fe rre d . Of co u rs e , read( ) t ra n s fe rs t h e d a t a fro m t h e file in t o t h e b u ffe r, wh ile write( ) d o e s t h e o p p o s it e . Bo t h s ys t e m ca lls re t u rn e it h e r t h e n u m b e r o f b yt e s t h a t we re s u cce s s fu lly t ra n s fe rre d o r - 1 t o s ig n a l a n e rro r co n d it io n . A re t u rn va lu e le s s t h a n count d o e s n o t m e a n t h a t a n e rro r o ccu rre d . Th e ke rn e l is a lwa ys a llo we d t o t e rm in a t e t h e s ys t e m ca ll e ve n if n o t a ll re q u e s t e d b yt e s we re t ra n s fe rre d , a n d t h e u s e r a p p lica t io n m u s t a cco rd in g ly ch e ck t h e re t u rn va lu e a n d re is s u e , if n e ce s s a ry, t h e s ys t e m ca ll. Typ ica lly, a s m a ll va lu e is re t u rn e d wh e n re a d in g fro m a p ip e o r a t e rm in a l d e vice , wh e n re a d in g p a s t t h e e n d o f t h e file , o r wh e n t h e s ys t e m ca ll is in t e rru p t e d b y a s ig n a l. Th e En d - Of- File co n d it io n ( EOF) ca n e a s ily b e re co g n ize d b y a n u ll re t u rn va lu e fro m read( ). Th is co n d it io n will n o t b e co n fu s e d wit h a n a b n o rm a l t e rm in a t io n d u e t o a s ig n a l, b e ca u s e if read( ) is in t e rru p t e d b y a s ig n a l b e fo re a n y d a t a is re a d , a n e rro r o ccu rs .

Th e re a d o r writ e o p e ra t io n a lwa ys t a ke s p la ce a t t h e file o ffs e t s p e cifie d b y t h e cu rre n t file p o in t e r ( fie ld f_pos o f t h e file o b je ct ) . Bo t h s ys t e m ca lls u p d a t e t h e file p o in t e r b y a d d in g t h e n u m b e r o f t ra n s fe rre d b yt e s t o it . In s h o rt , b o t h sys_read( ) ( t h e read( )'s s e rvice ro u t in e ) a n d sys_write( ) ( t h e

write( )'s s e rvice ro u t in e ) p e rfo rm a lm o s t t h e s a m e s t e p s : 1 . In vo ke fget( ) t o d e rive fro m fd t h e a d d re s s file o f t h e co rre s p o n d in g file o b je ct a n d in cre m e n t t h e u s a g e co u n t e r file->f_count.

2 . Ch e ck wh e t h e r t h e fla g s in file->f_mode a llo w t h e re q u e s t e d a cce s s ( re a d o r writ e o p e ra t io n ) . 3 . In vo ke locks_verify_area( ) t o ch e ck wh e t h e r t h e re a re m a n d a t o ry lo cks fo r

t h e file p o rt io n t o b e a cce s s e d ( s e e S e ct io n 1 2 . 7 la t e r in t h is ch a p t e r) . 4 . In vo ke e it h e r file->f_op->read o r file->f_op->write t o t ra n s fe r t h e d a t a . Bo t h fu n ct io n s re t u rn t h e n u m b e r o f b yt e s t h a t we re a ct u a lly t ra n s fe rre d . As a s id e e ffe ct , t h e file p o in t e r is p ro p e rly u p d a t e d . 5 . In vo ke fput( ) t o d e cre m e n t t h e u s a g e co u n t e r file->f_count.

6 . Re t u rn t h e n u m b e r o f b yt e s a ct u a lly t ra n s fe rre d .

12.6.3 The close( ) System Call Th e lo o p in o u r e xa m p le co d e t e rm in a t e s wh e n t h e read( ) s ys t e m ca ll re t u rn s t h e va lu e 0 —t h a t is , wh e n a ll b yt e s o f / flo p p y / TES T h a ve b e e n co p ie d in t o / t m p / t e s t . Th e p ro g ra m ca n t h e n clo s e t h e o p e n file s , s in ce t h e co p y o p e ra t io n h a s co m p le t e d . Th e close( ) s ys t e m ca ll re ce ive s a s it s p a ra m e t e r fd, wh ich is t h e file d e s crip t o r o f t h e file t o b e clo s e d . Th e sys_close( ) s e rvice ro u t in e p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Ge t s t h e file o b je ct a d d re s s s t o re d in current->files->fd[fd]; if it is NULL, re t u rn s a n e rro r co d e . 2 . S e t s current->files->fd[fd] t o NULL. Re le a s e s t h e file d e s crip t o r fd b y cle a rin g t h e co rre s p o n d in g b it s in t h e open_fds a n d close_on_exec fie ld s o f

current->files ( s e e Ch a p t e r 2 0 fo r t h e Clo s e o n Exe cu t io n fla g ) . 3 . In vo ke s filp_close( ), wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s :

a . In vo ke s t h e flush m e t h o d o f t h e file o p e ra t io n s , if d e fin e d

b . Re le a s e s a n y m a n d a t o ry lo ck o n t h e file c. In vo ke s fput( ) t o re le a s e t h e file o b je ct

4 . Re t u rn s t h e e rro r co d e o f t h e flush m e t h o d ( u s u a lly 0 ) .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

12.7 File Locking Wh e n a file ca n b e a cce s s e d b y m o re t h a n o n e p ro ce s s , a s yn ch ro n iza t io n p ro b le m o ccu rs . Wh a t h a p p e n s if t wo p ro ce s s e s t ry t o writ e in t h e s a m e file lo ca t io n ? Or a g a in , wh a t h a p p e n s if a p ro ce s s re a d s fro m a file lo ca t io n wh ile a n o t h e r p ro ce s s is writ in g in t o it ? In t ra d it io n a l Un ix s ys t e m s , co n cu rre n t a cce s s e s t o t h e s a m e file lo ca t io n p ro d u ce u n p re d ict a b le re s u lt s . Ho we ve r, Un ix s ys t e m s p ro vid e a m e ch a n is m t h a t a llo ws t h e p ro ce s s e s t o lo ck a file re g io n s o t h a t co n cu rre n t a cce s s e s m a y b e e a s ily a vo id e d . Th e POS IX s t a n d a rd re q u ire s a file - lo ckin g m e ch a n is m b a s e d o n t h e fcntl( ) s ys t e m ca ll. It is p o s s ib le t o lo ck a n a rb it ra ry re g io n o f a file ( e ve n a s in g le b yt e ) o r t o lo ck t h e wh o le file ( in clu d in g d a t a a p p e n d e d in t h e fu t u re ) . S in ce a p ro ce s s ca n ch o o s e t o lo ck ju s t a p a rt o f a file , it ca n a ls o h o ld m u lt ip le lo cks o n d iffe re n t p a rt s o f t h e file . Th is kin d o f lo ck d o e s n o t ke e p o u t a n o t h e r p ro ce s s t h a t is ig n o ra n t o f lo ckin g . Like a crit ica l re g io n in co d e , t h e lo ck is co n s id e re d "a d vis o ry" b e ca u s e it d o e s n 't wo rk u n le s s o t h e r p ro ce s s e s co o p e ra t e in ch e ckin g t h e e xis t e n ce o f a lo ck b e fo re a cce s s in g t h e file . Th e re fo re , POS IX's lo cks a re kn o wn a s a d v is o ry lo ck s . Tra d it io n a l BS D va ria n t s im p le m e n t a d vis o ry lo ckin g t h ro u g h t h e flock( ) s ys t e m ca ll. Th is ca ll d o e s n o t a llo w a p ro ce s s t o lo ck a file re g io n , ju s t t h e wh o le file . Tra d it io n a l S ys t e m V va ria n t s p ro vid e t h e lockf( ) fu n ct io n , wh ich is ju s t a n in t e rfa ce t o

fcntl( ). Mo re im p o rt a n t ly, S ys t e m V Re le a s e 3 in t ro d u ce d m a n d a t o ry lo ck in g : t h e ke rn e l ch e cks t h a t e ve ry in vo ca t io n o f t h e open( ), read( ), a n d write( ) s ys t e m ca lls d o e s n o t vio la t e a m a n d a t o ry lo ck o n t h e file b e in g a cce s s e d . Th e re fo re , m a n d a t o ry lo cks a re e n fo rce d e ve n b e t we e n n o n co o p e ra t ive p ro ce s s e s . [ 8 ] A file is m a rke d a s a ca n d id a t e fo r m a n d a t o ry lo ckin g b y s e t t in g it s s e t - g ro u p b it ( S GID) a n d cle a rin g t h e g ro u p - e xe cu t e p e rm is s io n b it . S in ce t h e s e t - g ro u p b it m a ke s n o s e n s e wh e n t h e g ro u p - e xe cu t e b it is o ff, t h e ke rn e l in t e rp re t s t h a t co m b in a t io n a s a h in t t o u s e m a n d a t o ry lo cks in s t e a d o f a d vis o ry one s. [8]

Od d ly e n o u g h , a p ro ce s s m a y s t ill u n lin k ( d e le t e ) a file e ve n if s o m e o t h e r p ro ce s s o wn s a m a n d a t o ry lo ck o n it ! Th is p e rp le xin g s it u a t io n is p o s s ib le b e ca u s e wh e n a p ro ce s s d e le t e s a file h a rd lin k, it d o e s n o t m o d ify it s co n t e n t s , b u t o n ly t h e co n t e n t s o f it s p a re n t d ire ct o ry.

Wh e t h e r p ro ce s s e s u s e a d vis o ry o r m a n d a t o ry lo cks , t h e y ca n a n d e xclu s ive w rit e lo ck s . An y n u m b e r o f p ro ce s s e s m a y h a ve re g io n , b u t o n ly o n e p ro ce s s ca n h a ve a writ e lo ck o n it a t t h e p o s s ib le t o g e t a writ e lo ck wh e n a n o t h e r p ro ce s s o wn s a re a d a n d vice ve rs a ( s e e Ta b le 1 2 - 1 8 ) .

u s e b o t h s h a re d re a d lo ck s re a d lo cks o n s o m e file s a m e t im e . Mo re o ve r, it is n o t lo ck fo r t h e s a m e file re g io n ,

Ta b le 1 2 - 1 8 . W h e t h e r a lo c k is g ra n t e d

Gra n t re q u e s t fo r

Cu rre n t Lo c k s

Re a d lo c k ?

W rit e lo c k ?

No lo ck

Ye s

Ye s

Re a d lo ck

Ye s

No

Writ e lo ck

No

No

12.7.1 Linux File Locking Lin u x s u p p o rt s a ll fa s h io n s o f file lo ckin g : a d vis o ry a n d m a n d a t o ry lo cks , a s we ll a s t h e fcntl( ), flock( ), a n d t h e lockf( ) s ys t e m ca lls . Ho we ve r, t h e lockf( ) s ys t e m ca ll is ju s t a lib ra ry wra p p e r ro u t in e , a n d t h e re fo re is n o t d is cu s s e d h e re .

fcntl( )'s m a n d a t o ry lo cks ca n b e e n a b le d a n d d is a b le d o n a p e r- file s ys t e m b a s is u s in g t h e MS_MANDLOCK fla g ( t h e mand o p t io n ) o f t h e mount( ) s ys t e m ca ll. Th e d e fa u lt is t o s wit ch o ff m a n d a t o ry lo ckin g . In t h is ca s e , fcntl( ) cre a t e s a d vis o ry lo cks . Wh e n t h e fla g is s e t , fcntl( ) p ro d u ce s m a n d a t o ry lo cks if t h e file h a s t h e s e t - g ro u p b it o n a n d t h e g ro u p e xe cu t e b it o ff; it p ro d u ce s a d vis o ry lo cks o t h e rwis e . In e a rlie r Lin u x ve rs io n s , t h e flock( ) s ys t e m ca ll p ro d u ce d o n ly a d vis o ry lo cks , wit h o u t re g a rd o f t h e MS_MANDLOCK m o u n t fla g . Th is is t h e e xp e ct e d b e h a vio r o f t h e s ys t e m ca ll in a n y Un ix- like o p e ra t in g s ys t e m . In Lin u x 2 . 4 , h o we ve r, a s p e cia l kin d o f flock( )'s m a n d a t o ry lo ck h a s b e e n a d d e d t o a llo w p ro p e r s u p p o rt fo r s o m e p ro p rie t a ry n e t wo rk file s ys t e m im p le m e n t a t io n s . It is t h e s o - ca lle d s h a re - m o d e m a n d a t o ry lo o k ; wh e n s e t , n o o t h e r p ro ce s s m a y o p e n a file t h a t wo u ld co n flict wit h t h e a cce s s m o d e o f t h e lo ck. Us e o f t h is fe a t u re fo r n a t ive Un ix a p p lica t io n s is d is co u ra g e d , b e ca u s e t h e re s u lt in g s o u rce co d e will b e n o n p o rt a b le . An o t h e r kin d o f flock( )- b a s e d m a n d a t o ry lo ck ca lle d le a s e s h a s b e e n in t ro d u ce d in Lin u x 2 . 4 . Wh e n a p ro ce s s t rie s t o o p e n a file p ro t e ct e d b y a le a s e , it is b lo cke d a s u s u a l. Ho we ve r, t h e p ro ce s s t h a t o wn s t h e lo ck re ce ive s a s ig n a l. On ce in fo rm e d , it s h o u ld firs t u p d a t e t h e file s o t h a t it s co n t e n t is co n s is t e n t , a n d t h e n re le a s e t h e lo ck. If t h e o wn e r d o e s n o t d o t h is in a we ll- d e fin e d t im e in t e rva l ( t u n a b le b y writ in g a n u m b e r o f s e co n d s in t o / p ro c/ s y s / fs / le a s e - b re a k - t im e , u s u a lly 4 5 s e co n d s ) , t h e le a s e is a u t o m a t ica lly re m o ve d b y t h e ke rn e l a n d t h e b lo cke d p ro ce s s is a llo we d t o co n t in u e . Be s id e t h e ch e cks in t h e read( ) a n d write( ) s ys t e m ca lls , t h e ke rn e l t a ke s in t o co n s id e ra t io n t h e e xis t e n ce o f m a n d a t o ry lo cks wh e n s e rvicin g a ll s ys t e m ca lls t h a t co u ld m o d ify t h e co n t e n t s o f a file . Fo r in s t a n ce , a n open( ) s ys t e m ca ll wit h t h e O_TRUNC fla g s e t fa ils if a n y m a n d a t o ry lo ck e xis t s fo r t h e file . A lo ck p ro d u ce d b y fcntl( ) is o f t yp e FL_POSIX, wh ile a lo ck p ro d u ce d b y flock( ) is o f t yp e FL_FLOCK, FL_MAND ( fo r s h a re - m o d e lo cks ) , o r FL_LEASE ( fo r le a s e s ) . Th e t yp e s o f lo cks p ro d u ce d b y fcntl( ) m a y s a fe ly co e xis t wit h t h o s e p ro d u ce d b y flock( ), b u t

n e it h e r o n e h a s a n y e ffe ct o n t h e o t h e r. Th e re fo re , a file lo cke d t h ro u g h fcntl( ) d o e s n o t a p p e a r lo cke d t o flock( ), a n d vice ve rs a .

Th e fo llo win g s e ct io n d e s crib e s t h e m a in d a t a s t ru ct u re u s e d b y t h e ke rn e l t o h a n d le file lo cks . Th e n e xt t wo s e ct io n s e xa m in e t h e d iffe re n ce s b e t we e n t h e t wo m o s t co m m o n lo ck t yp e s : FL_POSIX a n d FL_FLOCK.

12.7.2 File-Locking Data Structures Th e file_lock d a t a s t ru ct u re re p re s e n t s file lo cks ; it s fie ld s a re s h o wn in Ta b le 1 2 - 1 9 . All

file_lock d a t a s t ru ct u re s a re in clu d e d in a d o u b ly lin ke d lis t . Th e a d d re s s o f t h e firs t e le m e n t is s t o re d in file_lock_list, wh ile t h e fie ld s fl_nextlink a n d fl_prevlink s t o re t h e a d d re s s e s o f t h e a d ja ce n t e le m e n t s in t h e lis t .

Ta b le 1 2 - 1 9 . Th e fie ld s o f t h e file _ lo c k d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct file_lock *

fl_next

Ne xt e le m e n t in in o d e lis t

struct list_head

fl_link

Po in t e rs fo r g lo b a l lis t

struct list_head

fl_block

Po in t e rs fo r p ro ce s s lis t

struct files_struct *

fl_owner

Own e r's files_struct

unsigned int

fl_pid

PID o f t h e p ro ce s s o wn e r

wait_queue_head_t

fl_wait

Wa it q u e u e o f b lo cke d p ro ce s s e s

struct file *

fl_file

Po in t e r t o file o b je ct

unsigned char

fl_flags

Lo ck fla g s

unsigned char

fl_type

Lo ck t yp e

loff_t

fl_start

S t a rt in g o ffs e t o f lo cke d re g io n

loff_t

fl_end

En d in g o ffs e t o f lo cke d re g io n

void (*)(struct file_lock *) fl_notify Fu n ct io n t o ca ll wh e n lo ck is u n b lo cke d

void (*)(struct file_lock *) fl_insert Fu n ct io n t o ca ll wh e n lo ck is in s e rt e d void (*)(struct file_lock *) fl_remove Fu n ct io n t o ca ll wh e n lo ck is re m o ve d struct fasync_struct *

fl_fasync Us e d fo r le a s e b re a k n o t ifica t io n s

union

u

File s ys t e m - s p e cific in fo rm a t io n

All lock_file s t ru ct u re s t h a t re fe r t o t h e s a m e file o n d is k a re co lle ct e d in a s im p ly lin ke d lis t , wh o s e firs t e le m e n t is p o in t e d t o b y t h e i_flock fie ld o f t h e in o d e o b je ct . Th e fl_next fie ld o f t h e lock_file s t ru ct u re s p e cifie s t h e n e xt e le m e n t in t h e lis t .

Wh e n a p ro ce s s t rie s t o g e t a n a d vis o ry o r m a n d a t o ry lo ck, it m a y b e s u s p e n d e d u n t il t h e p re vio u s ly a llo ca t e d lo ck o n t h e s a m e file re g io n is re le a s e d . All p ro ce s s e s s le e p in g o n s o m e lo ck a re in s e rt e d in t o a wa it q u e u e , wh o s e h e a d is s t o re d in t h e fl_wait fie ld o f t h e

file_lock s t ru ct u re . Mo re o ve r, a ll p ro ce s s e s s le e p in g o n a n y file lo cks a re in s e rt e d in t o a circu la r d o u b ly lin ke d lis t , wh o s e h e a d ( firs t d u m m y e le m e n t ) is s t o re d in t h e

blocked_list va ria b le ; t h e fl_block fie ld o f t h e file_lock d a t a s t ru ct u re s t o re s t h e p o in t e r t o a d ja ce n t e le m e n t s in t h e lis t .

12.7.3 FL_FLOCK Locks An FL_FLOCK lo ck is a lwa ys a s s o cia t e d wit h a file o b je ct a n d is t h u s m a in t a in e d b y a p a rt icu la r p ro ce s s ( o r clo n e p ro ce s s e s s h a rin g t h e s a m e o p e n e d file ) . Wh e n a lo ck is re q u e s t e d a n d g ra n t e d , t h e ke rn e l re p la ce s a n y o t h e r lo ck t h a t t h e p ro ce s s is h o ld in g o n t h e s a m e file o b je ct . Th is h a p p e n s o n ly wh e n a p ro ce s s wa n t s t o ch a n g e a n a lre a d y o wn e d re a d lo ck in t o a writ e o n e , o r vice ve rs a . Mo re o ve r, wh e n a file o b je ct is b e in g fre e d b y t h e fput( ) fu n ct io n , a ll

FL_FLOCK lo cks t h a t re fe r t o t h e file o b je ct a re d e s t ro ye d . Ho we ve r, t h e re co u ld b e o t h e r FL_FLOCK re a d lo cks s e t b y o t h e r p ro ce s s e s fo r t h e s a m e file ( in o d e ) , a n d t h e y s t ill re m a in a ct ive . Th e flock( ) s ys t e m ca ll a ct s o n t wo p a ra m e t e rs : t h e fd file d e s crip t o r o f t h e file t o b e a ct e d u p o n a n d a cmd p a ra m e t e r t h a t s p e cifie s t h e lo ck o p e ra t io n . A cmd p a ra m e t e r o f

LOCK_SH re q u ire s a s h a re d lo ck fo r re a d in g , LOCK_EX re q u ire s a n e xclu s ive lo ck fo r writ in g , a n d LOCK_UN re le a s e s t h e lo ck. If t h e LOCK_NB va lu e is ORe d t o t h e LOCK_SH o r LOCK_EX o p e ra t io n , t h e s ys t e m ca ll d o e s n o t b lo ck; in o t h e r wo rd s , if t h e lo ck ca n n o t b e im m e d ia t e ly o b t a in e d , t h e s ys t e m ca ll re t u rn s a n e rro r co d e . No t e t h a t it is n o t p o s s ib le t o s p e cify a re g io n in s id e t h e file —t h e lo ck a lwa ys a p p lie s t o t h e wh o le file . Wh e n t h e sys_flock( ) s e rvice ro u t in e is in vo ke d , it p e rfo rm s t h e fo llo win g s t e p s :

1 . Ch e cks wh e t h e r fd is a va lid file d e s crip t o r; if n o t , re t u rn s a n e rro r co d e . Ge t s t h e a d d re s s o f t h e co rre s p o n d in g file o b je ct .

2 . If t h e p ro ce s s h a s t o a cq u ire a n a d vis o ry lo ck, ch e cks t h a t t h e p ro ce s s h a s b o t h re a d a n d writ e p e rm is s io n o n t h e o p e n file ; if n o t , re t u rn s a n e rro r co d e . 3 . In vo ke s flock_lock_file( ), p a s s in g a s p a ra m e t e rs t h e file o b je ct p o in t e r filp, t h e t yp e type o f lo ck o p e ra t io n re q u ire d , a n d a fla g wait. Th is la s t p a ra m e t e r is s e t if t h e s ys t e m ca ll s h o u ld b lo ck ( LOCK_NB cle a r) a n d cle a re d o t h e rwis e ( LOOK_NB s e t ) . Th is fu n ct io n p e rfo rm s , in t u rn , t h e fo llo win g a ct io n s : a . If t h e lo ck m u s t b e a cq u ire d , g e t s a n e w file _ lo ck o b je ct a n d fills it wit h t h e a p p ro p ria t e lo ck o p e ra t io n . b . S e a rch e s t h e lis t t h a t filp->f_dentry->d_inode->i_flock p o in t s t o . If a n FL_FLOCK lo ck fo r t h e s a m e file o b je ct is fo u n d a n d a n u n lo ck o p e ra t io n is re q u ire d , re m o ve s t h e file_lock e le m e n t fro m t h e in o d e lis t a n d t h e g lo b a l lis t , wa ke s u p a ll p ro ce s s e s s le e p in g in t h e lo ck's wa it q u e u e , fre e s t h e file_lock s t ru ct u re , a n d re t u rn s .

c. Ot h e rwis e , s e a rch e s t h e in o d e lis t a g a in t o ve rify t h a t n o e xis t in g FL_FLOCK lo ck co n flict s wit h t h e re q u e s t e d o n e . Th e re m u s t b e n o FL_FLOCK writ e lo ck in t h e in o d e lis t , a n d m o re o ve r, t h e re m u s t b e n o FL_FLOCK lo ck a t a ll if t h e p ro ce s s in g is re q u e s t in g a writ e lo ck. Ho we ve r, a p ro ce s s m a y wa n t t o ch a n g e t h e t yp e o f lo ck it a lre a d y o wn s ; t h is is d o n e b y is s u in g a s e co n d flock( ) s ys t e m ca ll. Th e re fo re , t h e ke rn e l a lwa ys a llo ws t h e p ro ce s s t o ch a n g e lo cks t h a t re fe r t o t h e s a m e file o b je ct . If a co n flict in g lo ck is fo u n d a n d t h e LOCK_NB fla g wa s s p e cifie d , t h e fu n ct io n re t u rn s a n e rro r co d e ; o t h e rwis e , it in s e rt s t h e cu rre n t p ro ce s s in t h e circu la r lis t o f b lo cke d p ro ce s s e s a n d s u s p e n d s it . d . If n o in co m p a t ib ilit y e xis t s , in s e rt s t h e file_lock s t ru ct u re in t o t h e g lo b a l lo ck lis t a n d t h e in o d e lis t , a n d t h e n re t u rn s 0 ( s u cce s s ) . 4 . Re t u rn s t h e re t u rn co d e o f flock_lock_file( ).

12.7.4 FL_POSIX Locks An FL_POSIX lo ck is a lwa ys a s s o cia t e d wit h a p ro ce s s a n d wit h a n in o d e ; t h e lo ck is a u t o m a t ica lly re le a s e d e it h e r wh e n t h e p ro ce s s d ie s o r wh e n a file d e s crip t o r is clo s e d ( e ve n if t h e p ro ce s s o p e n e d t h e s a m e file t wice o r d u p lica t e d a file d e s crip t o r) . Mo re o ve r, FL_POSIX lo cks a re n e ve r in h e rit e d b y t h e ch ild a cro s s a fork( ).

Wh e n u s e d t o lo ck file s , t h e fcntl( ) s ys t e m ca ll a ct s o n t h re e p a ra m e t e rs : t h e fd file d e s crip t o r o f t h e file t o b e a ct e d u p o n , a cmd p a ra m e t e r t h a t s p e cifie s t h e lo ck o p e ra t io n , a n d a n fl p o in t e r t o a flock d a t a s t ru ct u re . Ve rs io n 2 . 4 o f Lin u x a ls o d e fin e s a flock64 s t ru ct u re , wh ich u s e s 6 4 - b it fie ld s fo r t h e file o ffs e t a n d le n g t h fie ld s . In t h e fo llo win g , we fo cu s o n t h e flock d a t a s t ru ct u re , b u t t h e d e s crip t io n is va lid fo r flock64 t o o .

Lo cks o f t yp e FL_POSIX a re a b le t o p ro t e ct a n a rb it ra ry file re g io n , e ve n a s in g le b yt e . Th e re g io n is s p e cifie d b y t h re e fie ld s o f t h e flock s t ru ct u re . l_start is t h e in it ia l o ffs e t o f t h e re g io n a n d is re la t ive t o t h e b e g in n in g o f t h e file ( if fie ld l_whence is s e t t o SEEK_SET) , t o

t h e cu rre n t file p o in t e r ( if l_whence is s e t t o SEEK_CUR) , o r t o t h e e n d o f t h e file ( if

l_whence is s e t t o SEEK_END) . Th e l_len fie ld s p e cifie s t h e le n g t h o f t h e file re g io n ( o r 0 , wh ich m e a n s t h a t t h e re g io n in clu d e s a ll p o t e n t ia l writ e s p a s t t h e cu rre n t e n d o f t h e file ) . Th e sys_fcntl( ) s e rvice ro u t in e b e h a ve s d iffe re n t ly, d e p e n d in g o n t h e va lu e o f t h e fla g s e t in t h e cmd p a ra m e t e r:

F_GETLK De t e rm in e s wh e t h e r t h e lo ck d e s crib e d b y t h e flock s t ru ct u re co n flict s wit h s o m e

FL_POSIX lo ck a lre a d y o b t a in e d b y a n o t h e r p ro ce s s . In t h is ca s e , t h e flock s t ru ct u re is o ve rwrit t e n wit h t h e in fo rm a t io n a b o u t t h e e xis t in g lo ck.

F_SETLK S e t s t h e lo ck d e s crib e d b y t h e flock s t ru ct u re . If t h e lo ck ca n n o t b e a cq u ire d , t h e s ys t e m ca ll re t u rn s a n e rro r co d e .

F_SETLKW S e t s t h e lo ck d e s crib e d b y t h e flock s t ru ct u re . If t h e lo ck ca n n o t b e a cq u ire d , t h e s ys t e m ca ll b lo cks ; t h a t is , t h e ca llin g p ro ce s s is p u t t o s le e p .

F_GETLK64, F_SETLK64, F_SETLKW64 Id e n t ica l t o t h e p re vio u s o n e s , b u t t h e flock64 d a t a s t ru ct u re is u s e d ra t h e r t h a n

flock. Wh e n sys_fcntl( ) a cq u ire s a lo ck, it p e rfo rm s t h e fo llo win g :

1 . Re a d s t h e flock s t ru ct u re fro m u s e r s p a ce .

2 . Ge t s t h e file o b je ct co rre s p o n d in g t o fd.

3 . Ch e cks wh e t h e r t h e lo ck s h o u ld b e a m a n d a t o ry o n e a n d t h e file h a s a s h a re d m e m o ry m a p p in g ( s e e Ch a p t e r 1 5 ) . In t h is ca s e , re fu s e s t o cre a t e t h e lo ck a n d re t u rn s t h e -EAGAIN e rro r co d e ; t h e file is a lre a d y b e in g a cce s s e d b y a n o t h e r p ro ce s s . 4 . In it ia lize s a n e w file_lock s t ru ct u re a cco rd in g t o t h e co n t e n t s o f t h e u s e r's flock s t ru ct u re . 5 . Te rm in a t e s re t u rn in g a n e rro r co d e if t h e file d o e s n o t a llo w t h e a cce s s m o d e s p e cifie d b y t h e t yp e o f t h e re q u e s t e d lo ck. 6 . In vo ke s t h e lock m e t h o d o f t h e file o p e ra t io n s , if d e fin e d .

7 . In vo ke s t h e posix_lock_file( ) fu n ct io n , wh ich e xe cu t e s t h e fo llo win g a ct io n s :

a . In vo ke s posix_locks_conflict( ) fo r e a ch FL_POSIX lo ck in t h e in o d e 's lo ck lis t . Th e fu n ct io n ch e cks wh e t h e r t h e lo ck co n flict s wit h t h e re q u e s t e d o n e . Es s e n t ia lly, t h e re m u s t b e n o FL_POSIX writ e lo ck fo r t h e s a m e re g io n in t h e in o d e lis t , a n d t h e re m a y b e n o FL_POSIX lo ck a t a ll fo r t h e s a m e re g io n if t h e p ro ce s s is re q u e s t in g a writ e lo ck. Ho we ve r, lo cks o wn e d b y t h e s a m e p ro ce s s n e ve r co n flict ; t h is a llo ws a p ro ce s s t o ch a n g e t h e ch a ra ct e ris t ics o f a lo ck it a lre a d y o wn s . b . If a co n flict in g lo ck is fo u n d a n d fcntl( ) wa s in vo ke d wit h t h e F_SETLK o r

F_SETLK64 fla g , re t u rn s a n e rro r co d e . Ot h e rwis e , t h e cu rre n t p ro ce s s s h o u ld b e s u s p e n d e d . In t h is ca s e , in vo ke s posix_locks_deadlock( ) t o ch e ck t h a t n o d e a d lo ck co n d it io n is b e in g cre a t e d a m o n g p ro ce s s e s wa it in g fo r FL_POSIX lo cks , a n d t h e n in s e rt s t h e cu rre n t p ro ce s s in t h e circu la r lis t o f b lo cke d p ro ce s s e s a n d s u s p e n d s it . c. As s o o n a s t h e in o d e 's lo ck lis t in clu d e s n o co n flict in g lo ck, ch e cks a ll t h e FL_POSIX lo cks o f t h e cu rre n t p ro ce s s t h a t o ve rla p t h e file re g io n t h a t t h e cu rre n t p ro ce s s wa n t s t o lo ck, a n d co m b in e s a n d s p lit s a d ja ce n t a re a s a s re q u ire d . Fo r e xa m p le , if t h e p ro ce s s re q u e s t e d a writ e lo ck fo r a file re g io n t h a t fa lls in s id e a re a d - lo cke d wid e r re g io n , t h e p re vio u s re a d lo ck is s p lit in t o t wo p a rt s co ve rin g t h e n o n o ve rla p p in g a re a s , wh ile t h e ce n t ra l re g io n is p ro t e ct e d b y t h e n e w writ e lo ck. In ca s e o f o ve rla p s , n e we r lo cks a lwa ys re p la ce o ld e r o n e s . d . In s e rt s t h e n e w file_lock s t ru ct u re in t h e g lo b a l lo ck lis t a n d in t h e in o d e lis t . 8 . Re t u rn s t h e va lu e 0 ( s u cce s s ) .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 13. Managing I/O Devices Th e Virt u a l File S ys t e m in t h e la s t ch a p t e r d e p e n d s o n lo we r- le ve l fu n ct io n s t o ca rry o u t e a ch re a d , writ e , o r o t h e r o p e ra t io n in a m a n n e r s u it e d t o e a ch d e vice . Th e p re vio u s ch a p t e r in clu d e d a b rie f d is cu s s io n o f h o w o p e ra t io n s a re h a n d le d b y d iffe re n t file s ys t e m s . In t h is ch a p t e r, we lo o k a t h o w t h e ke rn e l in vo ke s t h e o p e ra t io n s o n a ct u a l d e vice s . In S e ct io n 1 3 . 1 , we g ive a b rie f s u rve y o f t h e 8 0 x 8 6 I/ O a rch it e ct u re . In S e ct io n 1 3 . 2 , we s h o w h o w t h e VFS a s s o cia t e s a s p e cia l file ca lle d "d e vice file " wit h e a ch d iffe re n t h a rd wa re d e vice s o t h a t a p p lica t io n p ro g ra m s ca n u s e a ll kin d s o f d e vice s in t h e s a m e wa y. Fin a lly, in S e ct io n 1 3 . 3 , we illu s t ra t e t h e o ve ra ll o rg a n iza t io n o f d e vice d rive rs in Lin u x. Re a d e rs in t e re s t e d in d e ve lo p in g d e vice d rive rs o n t h e ir o wn m a y wa n t t o re fe r t o Ale s s a n d ro Ru b in i a n d Jo n a t h a n Co rb e t 's Lin u x De v ice Driv e rs ( O'Re illy) . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

13.1 I/O Architecture To m a ke a co m p u t e r wo rk p ro p e rly, d a t a p a t h s m u s t b e p ro vid e d t h a t le t in fo rm a t io n flo w b e t we e n CPU( s ) , RAM, a n d t h e s co re o f I/ O d e vice s t h a t ca n b e co n n e ct e d t o a p e rs o n a l co m p u t e r. Th e s e d a t a p a t h s , wh ich a re d e n o t e d co lle ct ive ly a s t h e b u s , a ct a s t h e p rim a ry co m m u n ica t io n ch a n n e l in s id e t h e co m p u t e r. S e ve ra l t yp e s o f b u s e s , s u ch a s t h e IS A, EIS A, PCI, a n d MCA, a re cu rre n t ly in u s e . In t h is s e ct io n , we d is cu s s t h e fu n ct io n a l ch a ra ct e ris t ics co m m o n t o a ll PC a rch it e ct u re s , wit h o u t g ivin g d e t a ils a b o u t a s p e cific b u s t yp e . In fa ct , wh a t is co m m o n ly d e n o t e d a s a b u s co n s is t s o f t h re e s p e cia lize d b u s e s : Da t a b u s A g ro u p o f lin e s t h a t t ra n s fe r d a t a in p a ra lle l. Th e Pe n t iu m h a s a 6 4 - b it - wid e d a t a bus. Ad d re s s b u s A g ro u p o f lin e s t h a t t ra n s m it s a n a d d re s s in p a ra lle l. Th e Pe n t iu m h a s a 3 2 - b it - wid e a d d re s s b u s . Co n t ro l b u s A g ro u p o f lin e s t h a t t ra n s m it s co n t ro l in fo rm a t io n t o t h e co n n e ct e d circu it s . Th e Pe n t iu m u s e s co n t ro l lin e s t o s p e cify, fo r in s t a n ce , wh e t h e r t h e b u s is u s e d t o a llo w d a t a t ra n s fe rs b e t we e n a p ro ce s s o r a n d t h e RAM, o r a lt e rn a t ive ly, b e t we e n a p ro ce s s o r a n d a n I/ O d e vice . Co n t ro l lin e s a ls o d e t e rm in e wh e t h e r a re a d o r a writ e t ra n s fe r m u s t b e p e rfo rm e d . Wh e n t h e b u s co n n e ct s t h e CPU t o a n I/ O d e vice , it is ca lle d a n I/ O b u s . In t h is ca s e , 8 0 x 8 6 m icro p ro ce s s o rs u s e 1 6 o u t o f t h e 3 2 a d d re s s lin e s t o a d d re s s I/ O d e vice s a n d 8 , 1 6 , o r 3 2 o u t o f t h e 6 4 d a t a lin e s t o t ra n s fe r d a t a . Th e I/ O b u s , in t u rn , is co n n e ct e d t o e a ch I/ O d e vice b y m e a n s o f a h ie ra rch y o f h a rd wa re co m p o n e n t s in clu d in g u p t o t h re e e le m e n t s : I/ O p o rt s , in t e rfa ce s , a n d d e vice co n t ro lle rs . Fig u re 1 3 - 1 s h o ws t h e co m p o n e n t s o f t h e I/ O a rch it e ct u re . Fig u re 1 3 - 1 . P C's I / O a rc h it e c t u re

13.1.1 I/O Ports Ea ch d e vice co n n e ct e d t o t h e I/ O b u s h a s it s o wn s e t o f I/ O a d d re s s e s , wh ich a re u s u a lly ca lle d I/ O p o rt s . In t h e IBM PC a rch it e ct u re , t h e I/ O a d d re s s s p a ce p ro vid e s u p t o 6 5 , 5 3 6 8 b it I/ O p o rt s . Two co n s e cu t ive 8 - b it p o rt s m a y b e re g a rd e d a s a s in g le 1 6 - b it p o rt , wh ich m u s t s t a rt o n a n e ve n a d d re s s . S im ila rly, t wo co n s e cu t ive 1 6 - b it p o rt s m a y b e re g a rd e d a s a s in g le 3 2 - b it p o rt , wh ich m u s t s t a rt o n a n a d d re s s t h a t is a m u lt ip le o f 4 . Fo u r s p e cia l a s s e m b ly la n g u a g e in s t ru ct io n s ca lle d in, ins, out, a n d outs a llo w t h e CPU t o re a d fro m a n d writ e in t o a n I/ O p o rt . Wh ile e xe cu t in g o n e o f t h e s e in s t ru ct io n s , t h e CPU u s e s t h e a d d re s s b u s t o s e le ct t h e re q u ire d I/ O p o rt a n d o f t h e d a t a b u s t o t ra n s fe r d a t a b e t we e n a CPU re g is t e r a n d t h e p o rt . I/ O p o rt s m a y a ls o b e m a p p e d in t o a d d re s s e s o f t h e p h ys ica l a d d re s s s p a ce . Th e p ro ce s s o r is t h e n a b le t o co m m u n ica t e wit h a n I/ O d e vice b y is s u in g a s s e m b ly la n g u a g e in s t ru ct io n s t h a t o p e ra t e d ire ct ly o n m e m o ry ( fo r in s t a n ce , mov, and, or, a n d s o o n ) . Mo d e rn h a rd wa re d e vice s a re m o re s u it e d t o m a p p e d I/ O, s in ce it is fa s t e r a n d ca n b e co m b in e d wit h DMA. An im p o rt a n t o b je ct ive fo r s ys t e m d e s ig n e rs is t o o ffe r a u n ifie d a p p ro a ch t o I/ O p ro g ra m m in g wit h o u t s a crificin g p e rfo rm a n ce . To wa rd t h a t e n d , t h e I/ O p o rt s o f e a ch d e vice a re s t ru ct u re d in t o a s e t o f s p e cia lize d re g is t e rs , a s s h o wn in Fig u re 1 3 - 2 . Th e CPU writ e s t h e co m m a n d s t o b e s e n t t o t h e d e vice in t o t h e co n t ro l re g is t e r a n d re a d s a va lu e t h a t re p re s e n t s t h e in t e rn a l s t a t e o f t h e d e vice fro m t h e s t a t u s re g is t e r. Th e CPU a ls o fe t ch e s d a t a fro m t h e d e vice b y re a d in g b yt e s fro m t h e in p u t re g is t e r a n d p u s h e s d a t a t o t h e d e vice b y writ in g b yt e s in t o t h e o u t p u t re g is t e r. Fig u re 1 3 - 2 . S p e c ia liz e d I / O p o rt s

To lo we r co s t s , t h e s a m e I/ O p o rt is o ft e n u s e d fo r d iffe re n t p u rp o s e s . Fo r in s t a n ce , s o m e b it s d e s crib e t h e d e vice s t a t e , wh ile o t h e rs s p e cify t h e co m m a n d t o b e is s u e d t o t h e d e vice . S im ila rly, t h e s a m e I/ O p o rt m a y b e u s e d a s a n in p u t re g is t e r o r a n o u t p u t re g is t e r.

13.1.1.1 Accessing I/O ports Th e in, out, ins, a n d outs a s s e m b ly la n g u a g e in s t ru ct io n s a cce s s I/ O p o rt s . Th e fo llo win g a u xilia ry fu n ct io n s a re in clu d e d in t h e ke rn e l t o s im p lify s u ch a cce s s e s :

inb( ), inw( ), inl( ) Re a d 1 , 2 , o r 4 co n s e cu t ive b yt e s , re s p e ct ive ly, fro m a n I/ O p o rt . Th e s u ffix "b , " "w, " o r "l" re fe rs , re s p e ct ive ly, t o a b yt e ( 8 b it s ) , a wo rd ( 1 6 b it s ) , a n d a lo n g ( 3 2 b it s ) .

inb_p( ) inw_p( ), inl_p( ) Re a d 1 , 2 , o r 4 co n s e cu t ive b yt e s , re s p e ct ive ly, fro m a n I/ O p o rt , a n d t h e n e xe cu t e a "d u m m y" in s t ru ct io n t o in t ro d u ce a p a u s e .

outb( ), outw( ), outl( ) Writ e 1 , 2 , o r 4 co n s e cu t ive b yt e s , re s p e ct ive ly, t o a n I/ O p o rt .

outb_p( ), outw_p( ), outl_p( ) Writ e 1 , 2 , a n d 4 co n s e cu t ive b yt e s , re s p e ct ive ly, t o a n I/ O p o rt , a n d t h e n e xe cu t e a "d u m m y" in s t ru ct io n t o in t ro d u ce a p a u s e .

insb( ), insw( ), insl( ) Re a d s e q u e n ce s o f co n s e cu t ive b yt e s in g ro u p s o f 1 , 2 , o r 4 , re s p e ct ive ly, fro m a n I/ O p o rt . Th e le n g t h o f t h e s e q u e n ce is s p e cifie d a s a p a ra m e t e r o f t h e fu n ct io n s .

outsb( ), outsw( ), outsl( ) Writ e s e q u e n ce s o f co n s e cu t ive b yt e s , in g ro u p s o f 1 , 2 , o r 4 , re s p e ct ive ly, t o a n I/ O p o rt .

Wh ile a cce s s in g I/ O p o rt s is s im p le , d e t e ct in g wh ich I/ O p o rt s h a ve b e e n a s s ig n e d t o I/ O d e vice s m a y n o t b e e a s y, in p a rt icu la r, fo r s ys t e m s b a s e d o n a n IS A b u s . Oft e n a d e vice d rive r m u s t b lin d ly writ e in t o s o m e I/ O p o rt t o p ro b e t h e h a rd wa re d e vice ; if, h o we ve r, t h is I/ O p o rt is a lre a d y u s e d b y s o m e o t h e r h a rd wa re d e vice , a s ys t e m cra s h co u ld o ccu r. To p re ve n t s u ch s it u a t io n s , t h e ke rn e l ke e p s t ra ck o f I/ O p o rt s a s s ig n e d t o e a ch h a rd wa re d e vice b y m e a n s o f "re s o u rce s . " A re s o u rce re p re s e n t s a p o rt io n o f s o m e e n t it y t h a t ca n b e e xclu s ive ly a s s ig n e d t o a d e vice d rive r. In o u r ca s e , a re s o u rce re p re s e n t s a ra n g e o f I/ O p o rt a d d re s s e s . Th e in fo rm a t io n re la t ive t o e a ch re s o u rce is s t o re d in a resource d a t a s t ru ct u re , wh o s e fie ld s a re s h o wn in Ta b le 1 3 - 1 . All re s o u rce s o f t h e s a m e kin d a re in s e rt e d in a t re e - like d a t a s t ru ct u re ; fo r in s t a n ce , a ll re s o u rce s re p re s e n t in g I/ O p o rt a d d re s s ra n g e s a re in clu d e d in a t re e ro o t e d a t t h e n o d e ioport_resource.

Ta b le 1 3 - 1 . Th e fie ld s o f t h e re s o u rc e d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

const char *

name

De s crip t io n o f o wn e r o f t h e re s o u rce

unsigned long

start

S t a rt o f t h e re s o u rce ra n g e

unsigned long

end

En d o f t h e re s o u rce ra n g e

unsigned long

flags

Va rio u s fla g s

struct resource *

parent

Po in t e r t o p a re n t in t h e re s o u rce t re e

struct resource *

sibling

Po in t e r t o a s ib lin g in t h e re s o u rce t re e

struct resource *

child

Po in t e r t o firs t ch ild in t h e re s o u rce t re e

Th e ch ild re n o f a n o d e a re co lle ct e d in a lis t wh o s e firs t e le m e n t is p o in t e d t o b y t h e child fie ld . Th e sibling fie ld p o in t s t o t h e n e xt n o d e in t h e lis t .

Wh y u s e a t re e ? We ll, co n s id e r, fo r in s t a n ce , t h e I/ O p o rt a d d re s s e s u s e d b y a n IDE h a rd d is k in t e rfa ce —le t 's s a y fro m 0xf000 t o 0xf00f. A re s o u rce wit h t h e start fie ld s e t t o

0xf000 a n d t h e end fie ld s e t t o 0xf00f is t h e n in clu d e d in t h e t re e , a n d t h e co n ve n t io n a l n a m e o f t h e co n t ro lle r is s t o re d in t h e name fie ld . Ho we ve r, t h e IDE d e vice d rive r n e e d s t o re m e m b e r a n o t h e r b it o f in fo rm a t io n , n a m e ly t h a t t h e s u b ra n g e fro m 0xf000 t o 0xf007 is u s e d fo r t h e m a s t e r d is k o f t h e IDE ch a in , wh ile t h e s u b ra n g e fro m 0xf008 t o 0xf00f is u s e d fo r t h e s la ve d is k. To d o t h is , t h e d e vice d rive r in s e rt s t wo ch ild re n b e lo w t h e re s o u rce co rre s p o n d in g t o t h e wh o le ra n g e fro m 0xf000 t o 0xf00f, o n e ch ild fo r e a ch s u b ra n g e o f I/ O p o rt s . As a g e n e ra l ru le , e a ch n o d e o f t h e t re e m u s t co rre s p o n d t o a s u b ra n g e o f t h e ra n g e a s s o cia t e d wit h t h e p a re n t .

An y d e vice d rive r m a y u s e t h e fo llo win g t h re e fu n ct io n s , p a s s in g t o t h e m t h e ro o t n o d e o f t h e re s o u rce t re e a n d t h e a d d re s s o f a re s o u rce d a t a s t ru ct u re o f in t e re s t :

request_resource( ) As s ig n s a g ive n ra n g e t o a n I/ O d e vice .

check_resource( ) Ch e cks wh e t h e r a g ive n ra n g e is fre e o r wh e t h e r s o m e s u b ra n g e h a s a lre a d y b e e n a s s ig n e d t o a n I/ O d e vice

release_resource( ) Re le a s e s a g ive n ra n g e p re vio u s ly a s s ig n e d t o a n I/ O d e vice . Th e ke rn e l a ls o d e fin e s s o m e s h o rt cu t s t o t h e a b o ve fu n ct io n s t h a t a p p ly t o I/ O p o rt s : request_region( ) a s s ig n s a g ive n in t e rva l o f I/ O p o rt s , check_region( ) ve rifie s wh e t h e r a g ive n in t e rva l o f I/ O p o rt s is fre e o r ( e ve n p a rt ia lly) b u s y, a n d release_region(

) re le a s e s a p re vio u s ly a s s ig n e d in t e rva l o f I/ O p o rt s . Th e t re e o f a ll I/ O a d d re s s e s cu rre n t ly a s s ig n e d t o I/ O d e vice s ca n b e o b t a in e d fro m t h e / p ro c/ io p o rt s file .

13.1.2 I/O Interfaces An I/ O in t e rfa ce is a h a rd wa re circu it in s e rt e d b e t we e n a g ro u p o f I/ O p o rt s a n d t h e co rre s p o n d in g d e vice co n t ro lle r. It a ct s a s a n in t e rp re t e r t h a t t ra n s la t e s t h e va lu e s in t h e I/ O p o rt s in t o co m m a n d s a n d d a t a fo r t h e d e vice . In t h e o p p o s it e d ire ct io n , it d e t e ct s ch a n g e s in t h e d e vice s t a t e a n d co rre s p o n d in g ly u p d a t e s t h e I/ O p o rt t h a t p la ys t h e ro le o f s t a t u s re g is t e r. Th is circu it ca n a ls o b e co n n e ct e d t h ro u g h a n IRQ lin e t o a Pro g ra m m a b le In t e rru p t Co n t ro lle r, s o t h a t it is s u e s in t e rru p t re q u e s t s o n b e h a lf o f t h e d e vice . Th e re a re t wo t yp e s o f in t e rfa ce s : Cu s t o m I/ O in t e rfa ce s De vo t e d t o o n e s p e cific h a rd wa re d e vice . In s o m e ca s e s , t h e d e vice co n t ro lle r is lo ca t e d in t h e s a m e ca rd [1 ] t h a t co n t a in s t h e I/ O in t e rfa ce . Th e d e vice s a t t a ch e d t o a cu s t o m I/ O in t e rfa ce ca n b e e it h e r in t e rn a l d e v ice s ( d e vice s lo ca t e d in s id e t h e PC's ca b in e t ) o r e x t e rn a l d e v ice s ( d e vice s lo ca t e d o u t s id e t h e PC's ca b in e t ) . [1]

Ea ch ca rd m u s t b e in s e rt e d in o n e o f t h e a va ila b le fre e b u s s lo t s o f t h e PC. If t h e ca rd ca n b e co n n e ct e d t o a n e xt e rn a l d e vice t h ro u g h a n e xt e rn a l ca b le , t h e ca rd s p o rt s a s u it a b le co n n e ct o r in t h e re a r p a n e l o f t h e PC. Ge n e ra l- p u rp o s e I/ O in t e rfa ce s Us e d t o co n n e ct s e ve ra l d iffe re n t h a rd wa re d e vice s . De vice s a t t a ch e d t o a g e n e ra lp u rp o s e I/ O in t e rfa ce a re a lwa ys e xt e rn a l d e vice s .

13.1.2.1 Custom I/O interfaces Ju s t t o g ive a n id e a o f h o w m u ch va rie t y is e n co m p a s s e d b y cu s t o m I/ O in t e rfa ce s — t h u s b y t h e d e vice s cu rre n t ly in s t a lle d in a PC — we 'll lis t s o m e o f t h e m o s t co m m o n ly fo u n d : Ke y b o a rd in t e rfa ce Co n n e ct e d t o a ke yb o a rd co n t ro lle r t h a t in clu d e s a d e d ica t e d m icro p ro ce s s o r. Th is m icro p ro ce s s o r d e co d e s t h e co m b in a t io n o f p re s s e d ke ys , g e n e ra t e s a n in t e rru p t , a n d p u t s t h e co rre s p o n d in g s ca n co d e in a n in p u t re g is t e r. Gra p h ic in t e rfa ce Pa cke d t o g e t h e r wit h t h e co rre s p o n d in g co n t ro lle r in a g ra p h ic ca rd t h a t h a s it s o wn fra m e b u ffe r, a s we ll a s a s p e cia lize d p ro ce s s o r a n d s o m e co d e s t o re d in a Re a d - On ly Me m o ry ch ip ( ROM) . Th e fra m e b u ffe r is a n o n - b o a rd m e m o ry co n t a in in g a d e s crip t io n o f t h e cu rre n t s cre e n co n t e n t s . Dis k in t e rfa ce Co n n e ct e d b y a ca b le t o t h e d is k co n t ro lle r, wh ich is u s u a lly in t e g ra t e d wit h t h e d is k. Fo r in s t a n ce , t h e IDE in t e rfa ce is co n n e ct e d b y a 4 0 - wire fla t co n d u ct o r ca b le t o a n in t e llig e n t d is k co n t ro lle r t h a t ca n b e fo u n d o n t h e d is k it s e lf. Bu s m o u s e in t e rfa ce Co n n e ct e d b y a ca b le t o t h e co rre s p o n d in g co n t ro lle r, wh ich is in clu d e d in t h e m ouse . Ne t w o rk in t e rfa ce Pa cke d t o g e t h e r wit h t h e co rre s p o n d in g co n t ro lle r in a n e t wo rk ca rd u s e d t o re ce ive o r t ra n s m it n e t wo rk p a cke t s . Alt h o u g h t h e re a re s e ve ra l wid e ly a d o p t e d n e t wo rk s t a n d a rd s , Et h e rn e t ( IEEE 8 0 2 . 3 ) is t h e m o s t co m m o n .

13.1.2.2 General-purpose I/O interfaces Mo d e rn PCs in clu d e s e ve ra l g e n e ra l- p u rp o s e I/ O in t e rfa ce s , wh ich co n n e ct a wid e ra n g e o f e xt e rn a l d e vice s . Th e m o s t co m m o n in t e rfa ce s a re : Pa ra lle l p o rt Tra d it io n a lly u s e d t o co n n e ct p rin t e rs , it ca n a ls o b e u s e d t o co n n e ct re m o va b le d is ks , s ca n n e rs , b a cku p u n it s , a n d o t h e r co m p u t e rs . Th e d a t a is t ra n s fe rre d 1 b yt e ( 8 b it s ) a t a t im e . S e ria l p o rt Like t h e p a ra lle l p o rt , b u t t h e d a t a is t ra n s fe rre d 1 b it a t a t im e . It in clu d e s a Un ive rs a l As yn ch ro n o u s Re ce ive r a n d Tra n s m it t e r ( UART) ch ip t o s t rin g o u t t h e b yt e s t o b e s e n t in t o a s e q u e n ce o f b it s a n d t o re a s s e m b le t h e re ce ive d b it s in t o b yt e s .

S in ce it is in t rin s ica lly s lo we r t h a n t h e p a ra lle l p o rt , t h is in t e rfa ce is m a in ly u s e d t o co n n e ct e xt e rn a l d e vice s t h a t d o n o t o p e ra t e a t a h ig h s p e e d , like m o d e m s , m o u s e s , a n d p rin t e rs . Un iv e rs a l s e ria l b u s ( US B) A re ce n t g e n e ra l- p u rp o s e I/ O in t e rfa ce t h a t is q u ickly g a in in g p o p u la rit y. It o p e ra t e s a t a h ig h s p e e d , a n d m a y b e u s e d fo r t h e e xt e rn a l d e vice s t ra d it io n a lly co n n e ct e d t o t h e p a ra lle l p o rt a n d t h e s e ria l p o rt . PCMCIA in t e rfa ce In clu d e d m o s t ly o n p o rt a b le co m p u t e rs . Th e e xt e rn a l d e vice , wh ich h a s t h e s h a p e o f a cre d it ca rd , ca n b e in s e rt e d in t o a n d re m o ve d fro m a s lo t wit h o u t re b o o t in g t h e s ys t e m . Th e m o s t co m m o n PCMCIA d e vice s a re h a rd d is ks , m o d e m s , n e t wo rk ca rd s , a n d RAM e xp a n s io n s . S CS I ( S m a ll Co m p u t e r S y s t e m In t e rfa ce ) in t e rfa ce A circu it t h a t co n n e ct s t h e m a in PC b u s t o a s e co n d a ry b u s ca lle d t h e S CS I b u s . Th e S CS I- 2 b u s a llo ws u p t o e ig h t PCs a n d e xt e rn a l d e vice s —h a rd d is ks , s ca n n e rs , CDROM writ e rs , a n d s o o n —t o b e co n n e ct e d . Wid e S CS I- 2 a n d t h e re ce n t S CS I- 3 in t e rfa ce s a llo w yo u t o co n n e ct 1 6 d e vice s o r m o re if a d d it io n a l in t e rfa ce s a re p re s e n t . Th e S CS I s t a n d a rd is t h e co m m u n ica t io n p ro t o co l u s e d t o co n n e ct d e vice s via t h e S CS I b u s .

13.1.3 Device Controllers A co m p le x d e vice m a y re q u ire a d e v ice co n t ro lle r t o d rive it . Es s e n t ia lly, t h e co n t ro lle r p la ys t wo im p o rt a n t ro le s : ●



It in t e rp re t s t h e h ig h - le ve l co m m a n d s re ce ive d fro m t h e I/ O in t e rfa ce a n d fo rce s t h e d e vice t o e xe cu t e s p e cific a ct io n s b y s e n d in g p ro p e r s e q u e n ce s o f e le ct rica l s ig n a ls t o it . It co n ve rt s a n d p ro p e rly in t e rp re t s t h e e le ct rica l s ig n a ls re ce ive d fro m t h e d e vice a n d m o d ifie s ( t h ro u g h t h e I/ O in t e rfa ce ) t h e va lu e o f t h e s t a t u s re g is t e r.

A t yp ica l d e vice co n t ro lle r is t h e d is k co n t ro lle r, wh ich re ce ive s h ig h - le ve l co m m a n d s s u ch a s a "writ e t h is b lo ck o f d a t a " fro m t h e m icro p ro ce s s o r ( t h ro u g h t h e I/ O in t e rfa ce ) a n d co n ve rt s t h e m in t o lo w- le ve l d is k o p e ra t io n s s u ch a s "p o s it io n t h e d is k h e a d o n t h e rig h t t ra ck" a n d "writ e t h e d a t a in s id e t h e t ra ck. " Mo d e rn d is k co n t ro lle rs a re ve ry s o p h is t ica t e d , s in ce t h e y ca n ke e p t h e d is k d a t a in fa s t m e m o ry ca ch e s a n d ca n re o rd e r t h e CPU h ig h - le ve l re q u e s t s o p t im ize d fo r t h e a ct u a l d is k g e o m e t ry. S im p le r d e vice s d o n o t h a ve a d e vice co n t ro lle r; e xa m p le s in clu d e t h e Pro g ra m m a b le In t e rru p t Co n t ro lle r ( s e e S e ct io n 4 . 2 ) a n d t h e Pro g ra m m a b le In t e rva l Tim e r ( s e e S e ct io n 6.1.3). S e ve ra l h a rd wa re d e vice s in clu d e t h e ir o wn m e m o ry, wh ich is o ft e n ca lle d I/ O s h a re d m e m o ry . Fo r in s t a n ce , a ll re ce n t g ra p h ic ca rd s in clu d e a fe w m e g a b yt e s o f RAM in t h e fra m e b u ffe r, wh ich is u s e d t o s t o re t h e s cre e n im a g e t o b e d is p la ye d o n t h e m o n it o r.

13.1.3.1 Mapping addresses of I/O shared memory

De p e n d in g o n t h e d e vice a n d o n t h e b u s t yp e , I/ O s h a re d m e m o ry in t h e PC's a rch it e ct u re m a y b e m a p p e d wit h in t h re e d iffe re n t p h ys ica l a d d re s s ra n g e s : Fo r m o s t d e v ice s co n n e ct e d t o t h e IS A b u s Th e I/ O s h a re d m e m o ry is u s u a lly m a p p e d in t o t h e p h ys ica l a d d re s s e s ra n g in g fro m 0xa0000 t o 0xfffff; t h is g ive s ris e t o t h e "h o le " b e t we e n 6 4 0 KB a n d 1 MB m e n t io n e d in S e ct io n 2 . 5 . 3 . Fo r s o m e o ld d e v ice s u s in g t h e VES A Lo ca l Bu s ( VLB) Th is is a s p e cia lize d b u s m a in ly u s e d b y g ra p h ic ca rd s : t h e I/ O s h a re d m e m o ry is m a p p e d in t o t h e p h ys ica l a d d re s s e s ra n g in g fro m 0xe00000 t o 0xffffff—t h a t is , b e t we e n 1 4 MB a n d 1 6 MB. Th e s e d e vice s , wh ich fu rt h e r co m p lica t e t h e in it ia liza t io n o f t h e p a g in g t a b le s , a re g o in g o u t o f p ro d u ct io n . Fo r d e v ice s co n n e ct e d t o t h e PCI b u s Th e I/ O s h a re d m e m o ry is m a p p e d in t o ve ry la rg e p h ys ica l a d d re s s e s , we ll a b o ve t h e e n d o f RAM's p h ys ica l a d d re s s e s . Th is kin d o f d e vice is m u ch s im p le r t o h a n d le . Re ce n t ly, In t e l in t ro d u ce d t h e Acce le ra t e d Gra p h ics Po rt ( AGP) s t a n d a rd , wh ich is a n e n h a n ce m e n t o f PCI fo r h ig h - p e rfo rm a n ce g ra p h ic ca rd s . Be s id e h a vin g it s o wn I/ O s h a re d m e m o ry, t h is kin d o f ca rd is ca p a b le o f d ire ct ly a d d re s s in g p o rt io n s o f t h e m o t h e rb o a rd 's RAM b y m e a n s o f a s p e cia l h a rd wa re circu it n a m e d Gra p h ics Ad d re s s Re m a p p in g Ta b le ( GART) . Th e GART circu it ry e n a b le s AGP ca rd s t o s u s t a in m u ch h ig h e r d a t a t ra n s fe r ra t e s t h a n o ld e r PCI ca rd s . Fro m t h e ke rn e l's p o in t o f vie w, h o we ve r, it d o e s n 't re a lly m a t t e r wh e re t h e p h ys ica l m e m o ry is lo ca t e d , a n d GART- m a p p e d m e m o ry is h a n d le d like t h e o t h e r kin d s o f I/ O s h a re d m e m o ry.

13.1.3.2 Accessing the I/O shared memory Ho w d o e s t h e ke rn e l a cce s s a n I/ O s h a re d m e m o ry lo ca t io n ? Le t 's s t a rt wit h t h e PC's a rch it e ct u re , wh ich is re la t ive ly s im p le t o h a n d le , a n d t h e n e xt e n d t h e d is cu s s io n t o o t h e r a rch it e ct u re s . Re m e m b e r t h a t ke rn e l p ro g ra m s a ct o n lin e a r a d d re s s e s , s o t h e I/ O s h a re d m e m o ry lo ca t io n s m u s t b e e xp re s s e d a s a d d re s s e s g re a t e r t h a n PAGE_OFFSET. In t h e fo llo win g d is cu s s io n , we a s s u m e t h a t PAGE_OFFSET is e q u a l t o 0xc0000000 — t h a t is , t h a t t h e ke rn e l lin e a r a d d re s s e s a re in t h e fo u rt h g ig a b yt e . Ke rn e l d rive rs m u s t t ra n s la t e I/ O p h ys ica l a d d re s s e s o f I/ O s h a re d m e m o ry lo ca t io n s in t o lin e a r a d d re s s e s in ke rn e l s p a ce . In t h e PC a rch it e ct u re , t h is ca n b e a ch ie ve d s im p ly b y ORin g t h e 3 2 - b it p h ys ica l a d d re s s wit h t h e 0xc0000000 co n s t a n t . Fo r in s t a n ce , s u p p o s e t h e ke rn e l n e e d s t o s t o re t h e va lu e in t h e I/ O lo ca t io n a t p h ys ica l a d d re s s 0x000b0fe4 in t1 a n d t h e va lu e in t h e I/ O lo ca t io n a t p h ys ica l a d d re s s 0xfc000000 in t2 . On e m ig h t t h in k t h a t t h e fo llo win g s t a t e m e n t s co u ld d o t h e jo b :

t1 = *((unsigned char *)(0xc00b0fe4)); t2 = *((unsigned char *)(0xfc000000));

Du rin g t h e in it ia liza t io n p h a s e , t h e ke rn e l m a p s t h e a va ila b le RAM's p h ys ica l a d d re s s e s in t o t h e in it ia l p o rt io n o f t h e fo u rt h g ig a b yt e o f t h e lin e a r a d d re s s s p a ce . Th e re fo re , t h e Pa g in g Un it m a p s t h e 0xc00b0fe4 lin e a r a d d re s s a p p e a rin g in t h e firs t s t a t e m e n t b a ck t o t h e o rig in a l I/ O p h ys ica l a d d re s s 0x000b0fe4, wh ich fa lls in s id e t h e "IS A h o le " b e t we e n 6 4 0 KB a n d 1 MB ( s e e S e ct io n 2 . 5 ) . Th is wo rks fin e . Th e re is a p ro b le m , h o we ve r, fo r t h e s e co n d s t a t e m e n t b e ca u s e t h e I/ O p h ys ica l a d d re s s is g re a t e r t h a n t h e la s t p h ys ica l a d d re s s o f t h e s ys t e m RAM. Th e re fo re , t h e 0xfc000000 lin e a r a d d re s s d o e s n o t n e ce s s a rily co rre s p o n d t o t h e 0xfc000000 p h ys ica l a d d re s s . In s u ch ca s e s , t h e ke rn e l Pa g e Ta b le s m u s t b e m o d ifie d t o in clu d e a lin e a r a d d re s s t h a t m a p s t h e I/ O p h ys ica l a d d re s s . Th is ca n b e d o n e b y in vo kin g t h e ioremap( ) o r ioremap_nocache(

) fu n ct io n s . Th e s e fu n ct io n s , wh ich a re s im ila r t o vmalloc( ), in vo ke get_vm_area( ) t o cre a t e a n e w vm_struct d e s crip t o r ( s e e S e ct io n 7 . 3 . 2 ) fo r a lin e a r a d d re s s in t e rva l t h a t h a s t h e s ize o f t h e re q u ire d I/ O s h a re d m e m o ry a re a . Th e fu n ct io n s t h e n u p d a t e s t h e co rre s p o n d in g Pa g e Ta b le e n t rie s o f t h e ca n o n ica l ke rn e l Pa g e Ta b le s a p p ro p ria t e ly. ioremap_nocache( ) d iffe rs fro m ioremap( ) in t h a t it a ls o d is a b le s t h e h a rd wa re ca ch e wh e n re fe re n cin g t h e re m a p p e d lin e a r a d d re s s e s p ro p e rly. Th e co rre ct fo rm fo r t h e s e co n d s t a t e m e n t m ig h t t h e re fo re lo o k like :

io_mem = ioremap(0xfb000000, 0x200000); t2 = *((unsigned char *)(io_mem + 0x100000)); Th e firs t s t a t e m e n t cre a t e s a n e w 2 MB lin e a r a d d re s s in t e rva l, wh ich m a p s p h ys ica l a d d re s s e s s t a rt in g fro m 0xfb000000; t h e s e co n d o n e re a d s t h e m e m o ry lo ca t io n t h a t h a s t h e 0xfc000000 a d d re s s . To re m o ve t h e m a p p in g la t e r, t h e d e vice d rive r m u s t u s e t h e

iounmap( ) fu n ct io n . On s o m e a rch it e ct u re s o t h e r t h a n t h e PC, I/ O s h a re d m e m o ry ca n n o t b e a cce s s e d b y s im p ly d e re fe re n cin g t h e lin e a r a d d re s s p o in t in g t o t h e p h ys ica l m e m o ry lo ca t io n . Th e re fo re , Lin u x d e fin e s t h e fo llo win g a rch it e ct u re - d e p e n d e n t m a cro s , wh ich s h o u ld b e u s e d wh e n a cce s s in g I/ O s h a re d m e m o ry:

readb, readw, readl Re a d s 1 , 2 , o r 4 b yt e s , re s p e ct ive ly, fro m a n I/ O s h a re d m e m o ry lo ca t io n

writeb, writew, writel Writ e s 1 , 2 , o r 4 b yt e s , re s p e ct ive ly, in t o a n I/ O s h a re d m e m o ry lo ca t io n

memcpy_fromio, memcpy_toio Co p ie s a b lo ck o f d a t a fro m a n I/ O s h a re d m e m o ry lo ca t io n t o d yn a m ic m e m o ry a n d vice ve rs a

memset_io

Fills a n I/ O s h a re d m e m o ry a re a wit h a fixe d va lu e Th e re co m m e n d e d wa y t o a cce s s t h e 0xfc000000 I/ O lo ca t io n is t h u s :

io_mem = ioremap(0xfb000000, 0x200000); t2 = readb(io_mem + 0x100000); Th a n ks t o t h e s e m a cro s , a ll d e p e n d e n cie s o n p la t fo rm - s p e cific wa ys o f a cce s s in g t h e I/ O s h a re d m e m o ry ca n b e h id d e n .

13.1.4 Direct Memory Access (DMA) All PCs in clu d e a n a u xilia ry p ro ce s s o r ca lle d t h e Dire ct Me m o ry Acce s s Co n t ro lle r ( DMAC) , wh ich ca n b e in s t ru ct e d t o t ra n s fe r d a t a b e t we e n t h e RAM a n d a n I/ O d e vice . On ce a ct iva t e d b y t h e CPU, t h e DMAC is a b le t o co n t in u e t h e d a t a t ra n s fe r o n it s o wn ; wh e n t h e d a t a t ra n s fe r is co m p le t e d , t h e DMAC is s u e s a n in t e rru p t re q u e s t . Th e co n flict s t h a t o ccu r wh e n b o t h CPU a n d DMAC n e e d t o a cce s s t h e s a m e m e m o ry lo ca t io n a t t h e s a m e t im e a re re s o lve d b y a h a rd wa re circu it ca lle d a m e m o ry a rb it e r ( s e e S e ct io n 5 . 3 . 1 ) . Th e DMAC is m o s t ly u s e d b y d is k d rive rs a n d o t h e r s lo w d e vice s t h a t t ra n s fe r a la rg e n u m b e r o f b yt e s a t o n ce . Be ca u s e s e t u p t im e fo r t h e DMAC is re la t ive ly h ig h , it is m o re e fficie n t t o d ire ct ly u s e t h e CPU fo r t h e d a t a t ra n s fe r wh e n t h e n u m b e r o f b yt e s is s m a ll. Th e firs t DMACs fo r t h e o ld IS A b u s e s we re co m p le x, h a rd t o p ro g ra m , a n d lim it e d t o t h e lo we r 1 6 MB o f p h ys ica l m e m o ry. Mo re re ce n t DMACs fo r t h e PCI a n d S CS I b u s e s re ly o n d e d ica t e d h a rd wa re circu it s in t h e b u s e s a n d m a ke life e a s ie r fo r d e vice d rive r d e ve lo p e rs . Un t il n o w we h a ve d is t in g u is h e d t h re e kin d s o f m e m o ry a d d re s s e s : lo g ica l a n d lin e a r a d d re s s e s , wh ich a re u s e d in t e rn a lly b y t h e CPU, a n d p h ys ica l a d d re s s e s , wh ich a re t h e m e m o ry a d d re s s e s u s e d b y t h e CPU t o p h ys ica lly d rive t h e d a t a b u s . Ho we ve r, t h e re is a fo u rt h kin d o f m e m o ry a d d re s s : t h e s o - ca lle d b u s a d d re s s . It co rre s p o n d s t o t h e m e m o ry a d d re s s e s u s e d b y a ll h a rd wa re d e vice s e xce p t t h e CPU t o d rive t h e d a t a b u s . In t h e PC a rch it e ct u re , b u s a d d re s s e s co in cid e wit h p h ys ica l a d d re s s e s ; h o we ve r, in o t h e r a rch it e ct u re s ( like S u n 's S PARC a n d He wle t t - Pa cka rd 's Alp h a ) , t h e s e t wo kin d s o f a d d re s s e s d iffe r. Wh y s h o u ld t h e ke rn e l b e co n ce rn e d a t a ll a b o u t b u s a d d re s s e s ? We ll, in a DMA o p e ra t io n , t h e d a t a t ra n s fe r t a ke s p la ce wit h o u t CPU in t e rve n t io n ; t h e d a t a b u s is d rive n d ire ct ly b y t h e I/ O d e vice a n d t h e DMAC. Th e re fo re , wh e n t h e ke rn e l s e t s u p a DMA o p e ra t io n , it m u s t writ e t h e b u s a d d re s s o f t h e m e m o ry b u ffe r in vo lve d in t h e p ro p e r I/ O p o rt s o f t h e DMAC o r I/ O d e vice .

13.1.4.1 Putting DMA to work S e ve ra l I/ O d rive rs u s e t h e Dire ct Me m o ry Acce s s Co n t ro lle r ( DMAC) t o s p e e d u p o p e ra t io n s . Th e DMAC in t e ra ct s wit h t h e d e vice 's I/ O co n t ro lle r t o p e rfo rm a d a t a t ra n s fe r a n d t h e ke rn e l in clu d e s a n e a s y- t o - u s e s e t o f ro u t in e s t o p ro g ra m t h e DMAC. Th e I/ O co n t ro lle r s ig n a ls t o t h e CPU, via a n IRQ, wh e n t h e d a t a t ra n s fe r h a s fin is h e d . Wh e n a d e vice d rive r s e t s u p a DMA o p e ra t io n fo r s o m e I/ O d e vice , it m u s t s p e cify t h e m e m o ry b u ffe r in vo lve d b y u s in g b u s a d d re s s e s . Th e ke rn e l p ro vid e s t h e virt_to_bus a n d

bus_to_virt m a cro s , re s p e ct ive ly, t o t ra n s la t e a lin e a r a d d re s s in t o a b u s a d d re s s a n d vice

ve rs a . As wit h IRQ lin e s , t h e DMAC is a re s o u rce t h a t m u s t b e a s s ig n e d d yn a m ica lly t o t h e d rive rs t h a t n e e d it . Th e wa y t h e d rive r s t a rt s a n d e n d s DMA o p e ra t io n s d e p e n d s o n t h e t yp e o f b u s . Fo r re ce n t b u s e s , s u ch a s PCI o r S CS I, t h e re a re t wo m a in s t e p s t o p e rfo rm : a llo ca t in g a n IRQ lin e a n d t rig g e rin g t h e DMA t ra n s fe r. Th e IRQ lin e u s e d fo r s ig n a lin g t h e t e rm in a t io n o f t h e DMA o p e ra t io n is a llo ca t e d wh e n o p e n in g t h e d e vice file ( s e e t h e la t e r s e ct io n S e ct io n 1 3 . 3 . 4 ) . To s t a rt a DMA o p e ra t io n , t h e d e vice d rive r s im p ly writ e s t h e b u s a d d re s s o f t h e DMA b u ffe r, t h e t ra n s fe r d ire ct io n , a n d t h e s ize o f t h e d a t a in a n I/ O p o rt o f t h e h a rd wa re d e vice ; t h e d rive r t h e n s u s p e n d s t h e cu rre n t p ro ce s s . Wh e n t h e DMA t ra n s fe r e n d s , t h e h a rd wa re d e vice ra is e s a n in t e rru p t t h a t wa ke s t h e d e vice d rive r. Th e release m e t h o d o f t h e d e vice file re le a s e s t h e IRQ lin e wh e n t h e file o b je ct is clo s e d b y t h e la s t p ro ce s s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

13.2 Device Files As m e n t io n e d in Ch a p t e r 1 , Un ix- like o p e ra t in g s ys t e m s a re b a s e d o n t h e n o t io n o f a file , wh ich is ju s t a n in fo rm a t io n co n t a in e r s t ru ct u re d a s a s e q u e n ce o f ch a ra ct e rs . Acco rd in g t o t h is a p p ro a ch , I/ O d e vice s a re t re a t e d a s file s ; t h u s , t h e s a m e s ys t e m ca lls u s e d t o in t e ra ct wit h re g u la r file s o n d is k ca n b e u s e d t o d ire ct ly in t e ra ct wit h I/ O d e vice s . Fo r e xa m p le , t h e s a m e write( ) s ys t e m ca ll m a y b e u s e d t o writ e d a t a in t o a re g u la r file o r t o s e n d it t o a p rin t e r b y writ in g t o t h e / d e v / lp 0 d e vice file . Acco rd in g t o t h e ch a ra ct e ris t ics o f t h e u n d e rlyin g d e vice d rive rs , d e vice file s ca n b e o f t wo t yp e s : b lo ck o r ch a ra ct e r. Th e d iffe re n ce b e t we e n t h e t wo cla s s e s o f h a rd wa re d e vice s is n o t s o cle a r cu t . At le a s t we ca n a s s u m e t h e fo llo win g : ●



Th e d a t a o f a b lo ck d e vice ca n b e a d d re s s e d ra n d o m ly, a n d t h e t im e n e e d e d t o t ra n s fe r a n y d a t a b lo ck is s m a ll a n d ro u g h ly t h e s a m e , a t le a s t fro m t h e p o in t o f vie w o f t h e h u m a n u s e r. Typ ica l e xa m p le s o f b lo ck d e vice s a re h a rd d is ks , flo p p y d is ks , CD- ROM, a n d DVD p la ye rs . Th e d a t a o f a ch a ra ct e r d e vice e it h e r ca n n o t b e a d d re s s e d ra n d o m ly ( co n s id e r, fo r in s t a n ce , a s o u n d ca rd ) , o r t h e y ca n b e a d d re s s e d ra n d o m ly, b u t t h e t im e re q u ire d t o a cce s s a ra n d o m d a t u m la rg e ly d e p e n d s o n it s p o s it io n in s id e t h e d e vice ( co n s id e r, fo r in s t a n ce , a m a g n e t ic t a p e d rive r) .

Ne t wo rk ca rd s a re a re m a rka b le e xce p t io n t o t h is s ch e m a , s in ce t h e y a re h a rd wa re d e vice s t h a t a re n o t d ire ct ly a s s o cia t e d wit h file s ; we d e s crib e t h e m in Ch a p t e r 1 8 . In Lin u x 2 . 4 , t h e re a re t wo d iffe re n t kin d s o f d e vice file s : o ld - s t y le d e v ice file s , wh ich a re re a l file s s t o re d in t h e s ys t e m 's d ire ct o ry t re e , a n d d e v fs d e v ice file s , wh ich a re virt u a l file s like t h o s e o f t h e / p ro c file s ys t e m . Le t 's n o w d is cu s s b o t h t yp e s o f d e vice file s in m o re d e t a il.

13.2.1 Old-Style Device Files Old - s t y le d e v ice file s h a ve b e e n in u s e s in ce t h e e a rly ve rs io n s o f t h e Un ix o p e ra t in g s ys t e m . An o ld - s t yle d e vice file is a re a l file s t o re d in a file s ys t e m . It s in o d e , h o we ve r, d o e s n 't a d d re s s b lo cks o f d a t a o n t h e d is k. In s t e a d , t h e in o d e in clu d e s a n id e n t ifie r o f a h a rd wa re d e vice . Be s id e s it s n a m e a n d t yp e ( e it h e r ch a ra ct e r o r b lo ck, a s a lre a d y m e n t io n e d ) , e a ch d e vice file h a s t wo m a in a t t rib u t e s : Ma jo r n u m b e r A n u m b e r ra n g in g fro m 1 t o 2 5 4 t h a t id e n t ifie s t h e d e vice t yp e . Us u a lly, a ll d e vice file s t h a t h a ve t h e s a m e m a jo r n u m b e r a n d t h e s a m e t yp e s h a re t h e s a m e s e t o f file o p e ra t io n s , s in ce t h e y a re h a n d le d b y t h e s a m e d e vice d rive r. Min o r n u m b e r A n u m b e r t h a t id e n t ifie s a s p e cific d e vice a m o n g a g ro u p o f d e vice s t h a t s h a re t h e s a m e m a jo r n u m b e r. Th e mknod( ) s ys t e m ca ll is u s e d t o cre a t e o ld - s t yle d e vice file s . It re ce ive s t h e n a m e o f t h e d e vice file , it s t yp e , a n d t h e m a jo r a n d m in o r n u m b e rs a s p a ra m e t e rs . Th e la s t t wo

p a ra m e t e rs a re m e rg e d in a 1 6 - b it dev_t n u m b e r; t h e e ig h t m o s t s ig n ifica n t b it s id e n t ify t h e m a jo r n u m b e r, wh ile t h e re m a in in g o n e s id e n t ify t h e m in o r n u m b e r. Th e MAJOR a n d

MINOR m a cro s e xt ra ct t h e t wo va lu e s fro m t h e 1 6 - b it n u m b e r, wh ile t h e MKDEV m a cro m e rg e s a m a jo r a n d m in o r n u m b e r in t o a 1 6 - b it n u m b e r. Act u a lly, dev_t is t h e d a t a t yp e s p e cifica lly u s e d b y a p p lica t io n p ro g ra m s ; t h e ke rn e l u s e s t h e kdev_t d a t a t yp e . In Lin u x 2 . 4 , b o t h t yp e s re d u ce t o a n u n s ig n e d s h o rt in t e g e r, b u t kdev_t will b e co m e a co m p le t e d e vice file d e s crip t o r in s o m e fu t u re Lin u x ve rs io n . Th e m a jo r a n d m in o r n u m b e rs a re s t o re d in t h e i_rdev fie ld o f t h e in o d e o b je ct . Th e t yp e o f d e vice file ( ch a ra ct e r o r b lo ck) is s t o re d in t h e i_mode fie ld .

De vice file s a re u s u a lly in clu d e d in t h e / d e v d ire ct o ry. Ta b le 1 3 - 2 illu s t ra t e s t h e a t t rib u t e s o f s o m e d e vice file s . [ 2 ] No t ice t h a t ch a ra ct e r a n d b lo ck d e vice s h a ve in d e p e n d e n t n u m b e rin g , s o b lo ck d e vice ( 3 , 0 ) is u n iq u e fro m ch a ra ct e r d e vice ( 3 , 0 ) . [2]

Th e o fficia l re g is t ry o f a llo ca t e d d e vice n u m b e rs a n d / d e v d ire ct o ry n o d e s is s t o re d in t h e Do cu m e n t a t io n / d e v ice s . t x t file . Th e m a jo r n u m b e rs o f t h e d e vice s s u p p o rt e d m a y a ls o b e fo u n d in t h e in clu d e / lin u x / m a jo r. h file .

Ta b le 1 3 - 2 . Ex a m p le s o f d e v ic e file s

Na m e

Ty p e

Ma jo r

Min o r

D e s c rip t io n

/ d e v / fd 0

b lo ck

2

0

Flo p p y d is k

/ dev/ hda

b lo ck

3

0

Firs t IDE d is k

/ de v / h da2

b lo ck

3

2

S e co n d p rim a ry p a rt it io n o f firs t IDE d is k

/ dev/ hdb

b lo ck

3

64

S e co n d IDE d is k

/ dev/ hdb3

b lo ck

3

67

Th ird p rim a ry p a rt it io n o f s e co n d IDE d is k

/ de v/ ttyp0

ch a r

3

0

Te rm in a l

/ d e v / co n s o le

ch a r

5

1

Co n s o le

/ d e v / lp 1

ch a r

6

1

Pa ra lle l p rin t e r

/ de v/ ttyS 0

ch a r

4

64

Firs t s e ria l p o rt

/ d e v / rt c

ch a r

10

135

Re a l t im e clo ck

/ d e v / n u ll

ch a r

1

3

Nu ll d e vice ( b la ck h o le )

Us u a lly, a d e vice file is a s s o cia t e d wit h a h a rd wa re d e vice ( like a h a rd d is k—fo r in s t a n ce , / d e v / h d a ) o r wit h s o m e p h ys ica l o r lo g ica l p o rt io n o f a h a rd wa re d e vice ( like a d is k p a rt it io n —fo r in s t a n ce , / d e v / h d a 2 ) . In s o m e ca s e s , h o we ve r, a d e vice file is n o t a s s o cia t e d wit h a n y re a l h a rd wa re d e vice , b u t re p re s e n t s a fict it io u s lo g ica l d e vice . Fo r in s t a n ce , / d e v / n u ll is a d e vice file co rre s p o n d in g t o a "b la ck h o le "; a ll d a t a writ t e n in t o it is s im p ly d is ca rd e d , a n d t h e file a lwa ys a p p e a rs e m p t y. As fa r a s t h e ke rn e l is co n ce rn e d , t h e n a m e o f t h e d e vice file is irre le va n t . If yo u cre a t e a d e vice file n a m e d / t m p / d is k o f t yp e "b lo ck" wit h t h e m a jo r n u m b e r 3 a n d m in o r n u m b e r 0 , it wo u ld b e e q u iva le n t t o t h e / d e v / h d a d e vice file s h o wn in t h e t a b le . On t h e o t h e r h a n d , d e vice file n a m e s m a y b e s ig n ifica n t fo r s o m e a p p lica t io n p ro g ra m s . Fo r e xa m p le , a co m m u n ica t io n p ro g ra m m ig h t a s s u m e t h a t t h e firs t s e ria l p o rt is a s s o cia t e d wit h t h e / d e v / t t y S 0 d e vice file . Bu t m o s t a p p lica t io n p ro g ra m s ca n b e co n fig u re d t o in t e ra ct wit h a rb it ra rily n a m e d d e vice file s .

13.2.2 Devfs Device Files Id e n t ifyin g I/ O d e vice s b y m e a n s o f m a jo r a n d m in o r n u m b e rs h a s s o m e lim it a t io n s : 1 . Mo s t o f t h e d e vice s p re s e n t in a / d e v d ire ct o ry d o n 't e xis t ; t h e d e vice file s h a ve b e e n in clu d e d s o t h a t t h e s ys t e m a d m in is t ra t o r d o e s n 't n e e d t o cre a t e a d e vice file b e fo re in s t a llin g a n e w I/ O d rive r. Ho we ve r, a t yp ica l / d e v d ire ct o ry, wh ich in clu d e s o ve r 1 , 8 0 0 d e vice file s , in cre a s e s t h e t im e t a ke n t o lo o k u p a n in o d e wh e n firs t re fe re n ce d . 2 . Th e m a jo r a n d m in o r n u m b e rs a re 8 - b it lo n g . No wa d a ys , t h is is a lim it in g fa ct o r fo r s e ve ra l h a rd wa re d e vice s . Fo r in s t a n ce , it p o s e s p ro b le m s wh e n id e n t ifyin g S CS I d e vice s in clu d e d in ve ry la rg e s ys t e m s ( t h e Lin u x wo rka ro u n d co n s is t s o f a llo ca t in g s e ve ra l m a jo r n u m b e rs t o t h e S CS I d is k d rive ; a s a re s u lt , t h e ke rn e l s u p p o rt s u p t o 1 2 8 S CS I d is ks ) . Th e d e v fs d e v ice file s h a ve b e e n in t ro d u ce d t o s o lve t h e s e p ro b le m s a n d o t h e r m in o r is s u e s . Ho we ve r, a t t h e t im e o f t h is writ in g t h e y a re s t ill n o t wid e ly a d o p t e d ; t h u s , we lim it o u rs e lve s t o s ke t ch t h e m a in id e a s b e h in d it wit h o u t d e s crib in g t h e co d e . Th e d e v fs virt u a l file s ys t e m a llo ws d rive rs t o re g is t e r d e vice s b y n a m e ra t h e r t h a n b y m a jo r a n d m in o r n u m b e rs . Th e ke rn e l p ro vid e s a d e fa u lt n a m in g s ch e m e d e s ig n e d t o m a ke it e a s y t o s e a rch fo r s p e cific d e vice s . Fo r e xa m p le , a ll d is k d e vice s a re p la ce d u n d e r t h e / d e v / d is cs virt u a l d ire ct o ry; / d e v / h d a m ig h t b e co m e / d e v / d is cs / d is c0 , / d e v / h d b m ig h t b e co m e / d e v / d is cs / d is c1 , a n d s o o n . Us e rs ca n s t ill re fe r t o t h e o ld n a m e s ch e m e b y p ro p e rly co n fig u rin g a d e vice m a n a g e m e n t d a e m o n . I/ O d rive rs t h a t u s e t h e d e vfs file s ys t e m re g is t e r d e vice s b y in vo kin g devfs_register( ). S u ch fu n ct io n cre a t e s a n e w devfs_entry s t ru ct u re t h a t in clu d e s t h e d e vice file n a m e a n d a p o in t e r t o a t a b le o f d e vice d rive r m e t h o d s . A re g is t e re d d e vice file a u t o m a t ica lly a p p e a rs in a d e v fs virt u a l d ire ct o ry. Th e in o d e o b je ct o f a d e vice file in t h is d ire ct o ry is cre a t e d o n ly wh e n t h e file is a cce s s e d . [ 3 ] [3]

Th e d e v fs file s ys t e m is a virt u a l file s ys t e m , s im ila r t o t h e / p ro c

file s ys t e m . It d o e s n o t m a n a g e d is k s p a ce : in o d e o b je ct s a re cre a t e d in RAM wh e n n e e d e d a n d d o n o t h a ve a co rre s p o n d in g d is k in o d e . Op e n in g a d e vice file is a ls o s lig h t ly m o re e fficie n t b e ca u s e d e n t ry o b je ct s o f d e v fs file s in clu d e p o in t e rs t o t h e p ro p e r file o p e ra t io n s ( s e e S e ct io n 1 3 . 3 . 4 la t e r in t h is ch a p t e r) . Th e re a re , h o we ve r, s o m e p ro b le m s wit h d e v fs . Th e m o s t im p o rt a n t o n e is t h a t m a jo r a n d m in o r n u m b e rs a re s o m e wh a t in d is p e n s a b le fo r Un ix s ys t e m s . Firs t , s o m e Us e r Mo d e a p p lica t io n s like t h e NFS s e rve r o r t h e fin d co m m a n d re ly o n t h e m a jo r a n d m in o r n u m b e rs t o id e n t ify t h e p h ys ica l d is k p a rt it io n co n t a in in g a g ive n file . S e co n d , d e vice n u m b e rs a re re q u ire d e ve n b y t h e POS IX s t a n d a rd . Th u s , t h e d e v fs la ye r le t s t h e ke rn e l d e fin e m a jo r a n d m in o r n u m b e rs fo r e a ch d e vice d rive r, like t h e o ld - s t yle d e vice file s . Cu rre n t ly, a lm o s t a ll d e vice d rive rs a s s o cia t e t h e d e v fs d e vice file wit h t h e s a m e m a jo r a n d m in o r n u m b e rs o f t h e co rre s p o n d in g o ld - s t yle d e vice file . Fo r t h is re a s o n , we m a in ly fo cu s o n o ld - s t yle d e vice file s in t h e re s t o f t h is ch a p t e r.

13.2.3 VFS Handling of Device Files De vice file s live in t h e s ys t e m d ire ct o ry t re e b u t a re in t rin s ica lly d iffe re n t fro m re g u la r file s a n d d ire ct o rie s . Wh e n a p ro ce s s a cce s s e s a re g u la r file , it is a cce s s in g s o m e d a t a b lo cks in s o m e d is k p a rt it io n t h ro u g h a file s ys t e m ; wh e n a p ro ce s s a cce s s e s a d e vice file , it is ju s t d rivin g a h a rd wa re d e vice . Fo r in s t a n ce , a p ro ce s s m ig h t a cce s s a d e vice file t o re a d t h e ro o m t e m p e ra t u re fro m a d ig it a l t h e rm o m e t e r co n n e ct e d t o t h e co m p u t e r. It is t h e VFS 's re s p o n s ib ilit y t o h id e t h e d iffe re n ce s b e t we e n d e vice file s a n d re g u la r file s fro m a p p lica t io n p ro g ra m s . To d o t h is , t h e VFS ch a n g e s t h e d e fa u lt file o p e ra t io n s o f a d e vice file wh e n it is o p e n e d ; a s a re s u lt , e a ch s ys t e m ca ll o n t h e d e vice file is t ra n s la t e d t o a n in vo ca t io n o f a d e vice - re la t e d fu n ct io n in s t e a d o f t h e co rre s p o n d in g fu n ct io n o f t h e h o s t in g file s ys t e m . Th e d e vice - re la t e d fu n ct io n a ct s o n t h e h a rd wa re d e vice t o p e rfo rm t h e o p e ra t io n re q u e s t e d b y t h e p ro ce s s . [ 4 ] [4]

No t ice t h a t , t h a n ks t o t h e n a m e - re s o lvin g m e ch a n is m e xp la in e d in S e ct io n 1 2 . 5 , s ym b o lic lin ks t o d e vice file s wo rk ju s t like d e vice file s . Le t 's s u p p o s e t h a t a p ro ce s s e xe cu t e s a n open( ) s ys t e m ca ll o n a d e vice file ( e it h e r o f t yp e b lo ck o r ch a ra ct e r) . Th e o p e ra t io n s p e rfo rm e d b y t h e s ys t e m ca ll h a ve a lre a d y b e e n d e s crib e d in S e ct io n 1 2 . 6 . 1 . Es s e n t ia lly, t h e co rre s p o n d in g s e rvice ro u t in e re s o lve s t h e p a t h n a m e t o t h e d e vice file a n d s e t s u p t h e co rre s p o n d in g in o d e o b je ct , d e n t ry o b je ct , a n d file o b je ct . As s u m in g t h a t t h e d e vice file is o ld - s t yle , t h e in o d e o b je ct is in it ia lize d b y re a d in g t h e co rre s p o n d in g in o d e o n d is k t h ro u g h a s u it a b le fu n ct io n o f t h e file s ys t e m ( u s u a lly ext2_read_inode( ); s e e Ch a p t e r 1 7 ) . Wh e n t h is fu n ct io n d e t e rm in e s t h a t t h e d is k in o d e is re la t ive t o a d e vice file , it in vo ke s init_special_inode( ), wh ich in it ia lize s t h e

i_rdev fie ld o f t h e in o d e o b je ct t o t h e m a jo r a n d m in o r n u m b e rs o f t h e d e vice file , a n d s e t s t h e i_fop fie ld o f t h e in o d e o b je ct t o t h e a d d re s s o f e it h e r t h e def_blk_fops t a b le o r t h e def_chr_fops t a b le , a cco rd in g t o t h e t yp e o f d e vice file . Th e s e rvice ro u t in e o f t h e open( ) s ys t e m ca ll a ls o in vo ke s t h e dentry_open( ) fu n ct io n , wh ich a llo ca t e s a n e w file o b je ct a n d s e t s it s f_op fie ld t o t h e a d d re s s s t o re d in i_fop—t h a t is , t o t h e a d d re s s o f

def_blk_fops o r def_chr_fops o n ce a g a in . Th e co n t e n t s o f t h e s e t wo t a b le s a re s h o wn in t h e la t e r s e ct io n s S e ct io n 1 3 . 4 . 5 . 2 a n d S e ct io n 1 3 . 5 ; t h a n ks t o t h e m , a n y s ys t e m ca ll is s u e d o n a d e vice file will a ct iva t e a d e vice d rive r's fu n ct io n ra t h e r t h a n a fu n ct io n o f t h e u n d e rlyin g file s ys t e m . [ 5 ] [5]

If t h e d e vice file is a virt u a l file in t h e d e v fs file s ys t e m , t h e m e ch a n is m is s lig h t ly d iffe re n t : t h e d e v fs file s ys t e m la ye r d o e s n o t in vo ke init_special_inode( ); ra t h e r, a n y d e v fs d e vice file h a s a cu s t o m open m e t h o d ( t h e devfs_open( ) fu n ct io n ) , wh ich is in vo ke d b y dentry_open( ). It is t h e jo b o f devfs_open( ) fu n ct io n t o re writ e t h e f_op fie ld o f t h e file o b je ct in s u ch a wa y t o cu s t o m ize t h e o p e ra t io n s t rig g e re d b y t h e s ys t e m ca lls . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

13.3 Device Drivers A d e vice d rive r is a s o ft wa re la ye r t h a t m a ke s a h a rd wa re d e vice re s p o n d t o a we ll- d e fin e d p ro g ra m m in g in t e rfa ce . We a re a lre a d y fa m ilia r wit h t h is kin d o f in t e rfa ce ; it co n s is t s o f t h e ca n o n ica l s e t o f VFS fu n ct io n s ( o p e n , re a d , ls e e k , io ct l, a n d s o fo rt h ) t h a t co n t ro l a d e vice . Th e a ct u a l im p le m e n t a t io n o f a ll t h e s e fu n ct io n s is d e le g a t e d t o t h e d e vice d rive r. S in ce e a ch d e vice h a s a u n iq u e I/ O co n t ro lle r, a n d t h u s u n iq u e co m m a n d s a n d u n iq u e s t a t e in fo rm a t io n , m o s t I/ O d e vice s h a ve t h e ir o wn d rive rs . Th e re a re m a n y t yp e s o f d e vice d rive rs . Th e y m a in ly d iffe r in t h e le ve l o f s u p p o rt t h a t t h e y o ffe r t o t h e Us e r Mo d e a p p lica t io n s , a s we ll a s in t h e ir b u ffe rin g s t ra t e g ie s fo r t h e d a t a co lle ct e d fro m t h e h a rd wa re d e vice s . S in ce t h e s e ch o ice s g re a t ly in flu e n ce t h e in t e rn a l s t ru ct u re o f a d e vice d rive r, we d is cu s s t h e m in S e ct io n 1 3 . 3 . 1 a n d S e ct io n 1 3 . 3 . 2 . A d e vice d rive r d o e s n o t co n s is t o n ly o f t h e fu n ct io n s t h a t im p le m e n t t h e d e vice file o p e ra t io n s . Be fo re u s in g a d e vice d rive r, t wo a ct ivit ie s m u s t h a ve t a ke n p la ce : re g is t e rin g t h e d e vice d rive r a n d in it ia lizin g it . Fin a lly, wh e n t h e d e vice d rive r is p e rfo rm in g a d a t a t ra n s fe r, it m u s t a ls o m o n it o r t h e I/ O o p e ra t io n . We s e e h o w a ll t h is is d o n e in S e ct io n 1 3 . 3 . 3 , S e ct io n 1 3 . 3 . 4 , a n d S e ct io n 1 3 . 3 . 5 .

13.3.1 Levels of Kernel Support Th e Lin u x ke rn e l d o e s n o t fu lly s u p p o rt a ll p o s s ib le e xis t in g I/ O d e vice s . Ge n e ra lly s p e a kin g , in fa ct , t h e re a re t h re e p o s s ib le kin d s o f s u p p o rt fo r a h a rd wa re d e vice : No s u p p o rt a t a ll Th e a p p lica t io n p ro g ra m in t e ra ct s d ire ct ly wit h t h e d e vice 's I/ O p o rt s b y is s u in g s u it a b le in a n d out a s s e m b ly la n g u a g e in s t ru ct io n s .

Min im a l s u p p o rt Th e ke rn e l d o e s n o t re co g n ize t h e h a rd wa re d e vice , b u t d o e s re co g n ize it s I/ O in t e rfa ce . Us e r p ro g ra m s a re a b le t o t re a t t h e in t e rfa ce a s a s e q u e n t ia l d e vice ca p a b le o f re a d in g a n d / o r writ in g s e q u e n ce s o f ch a ra ct e rs . Ex t e n d e d s u p p o rt Th e ke rn e l re co g n ize s t h e h a rd wa re d e vice a n d h a n d le s t h e I/ O in t e rfa ce it s e lf. In fa ct , t h e re m ig h t n o t e ve n b e a d e vice file fo r t h e d e vice . Th e m o s t co m m o n e xa m p le o f t h e firs t a p p ro a ch , wh ich d o e s n o t re ly o n a n y ke rn e l d e vice d rive r, is h o w t h e X Win d o w S ys t e m t ra d it io n a lly h a n d le s t h e g ra p h ic d is p la y. Th is is q u it e e fficie n t , a lt h o u g h it co n s t ra in s t h e X s e rve r fro m u s in g t h e h a rd wa re in t e rru p t s is s u e d b y t h e I/ O d e vice . Th is a p p ro a ch a ls o re q u ire s s o m e a d d it io n a l e ffo rt t o a llo w t h e X s e rve r t o a cce s s t h e re q u ire d I/ O p o rt s . As m e n t io n e d in S e ct io n 3 . 3 . 2 , t h e iopl( ) a n d ioperm( ) s ys t e m ca lls g ra n t a p ro ce s s t h e p rivile g e t o a cce s s I/ O p o rt s . Th e y ca n b e in vo ke d o n ly b y p ro g ra m s h a vin g ro o t p rivile g e s . Bu t s u ch p ro g ra m s ca n b e m a d e a va ila b le t o u s e rs b y s e t t in g t h e fsuid fie ld o f t h e e xe cu t a b le file t o 0 , wh ich is t h e UID o f t h e s u p e ru s e r ( s e e S e ct io n 2 0 . 1 . 1 ) . Re ce n t Lin u x ve rs io n s s u p p o rt s e ve ra l wid e ly u s e d g ra p h ic ca rd s . Th e / d e v / fb d e vice file

p ro vid e s a n a b s t ra ct io n fo r t h e fra m e b u ffe r o f t h e g ra p h ic ca rd a n d a llo ws a p p lica t io n s o ft wa re t o a cce s s it wit h o u t n e e d in g t o kn o w a n yt h in g a b o u t t h e I/ O p o rt s o f t h e g ra p h ics in t e rfa ce . Fu rt h e rm o re , Ve rs io n 2 . 4 o f t h e ke rn e l s u p p o rt s t h e Dire ct Re n d e rin g In fra s t ru ct u re ( DRI) t h a t a llo ws a p p lica t io n s o ft wa re t o e xp lo it t h e h a rd wa re o f a cce le ra t e d 3 D g ra p h ics ca rd s . In a n y ca s e , t h e t ra d it io n a l d o - it - yo u rs e lf X Win d o w S ys t e m s e rve r is s t ill wid e ly a d o p t e d . Th e m in im a l s u p p o rt a p p ro a ch g e n e ra l- p u rp o s e I/ O in t e rfa ce . file ( a n d t h u s a d e vice d rive r) ; re a d in g a n d writ in g t h e d e vice

is u s e d t o h a n d le e xt e rn a l h a rd wa re d e vice s co n n e ct e d t o a Th e ke rn e l t a ke s ca re o f t h e I/ O in t e rfa ce b y o ffe rin g a d e vice t h e a p p lica t io n p ro g ra m h a n d le s t h e e xt e rn a l h a rd wa re d e vice b y file .

Min im a l s u p p o rt is p re fe ra b le t o e xt e n d e d s u p p o rt b e ca u s e it ke e p s t h e ke rn e l s ize s m a ll. Ho we ve r, a m o n g t h e g e n e ra l- p u rp o s e I/ O in t e rfa ce s co m m o n ly fo u n d o n a PC, o n ly t h e s e ria l p o rt a n d t h e p a ra lle l p o rt ca n b e h a n d le d wit h t h is a p p ro a ch . Th u s , a s e ria l m o u s e is d ire ct ly co n t ro lle d b y a n a p p lica t io n p ro g ra m , like t h e X s e rve r, a n d a s e ria l m o d e m a lwa ys re q u ire s a co m m u n ica t io n p ro g ra m , like Min ico m , S e yo n , o r a Po in t - t o - Po in t Pro t o co l ( PPP) d a e m o n . Min im a l s u p p o rt h a s a lim it e d ra n g e o f a p p lica t io n s b e ca u s e it ca n n o t b e u s e d wh e n t h e e xt e rn a l d e vice m u s t in t e ra ct h e a vily wit h in t e rn a l ke rn e l d a t a s t ru ct u re s . Fo r e xa m p le , co n s id e r a re m o va b le h a rd d is k t h a t is co n n e ct e d t o a g e n e ra l- p u rp o s e I/ O in t e rfa ce . An a p p lica t io n p ro g ra m ca n n o t in t e ra ct wit h a ll ke rn e l d a t a s t ru ct u re s a n d fu n ct io n s n e e d e d t o re co g n ize t h e d is k a n d t o m o u n t it s file s ys t e m , s o e xt e n d e d s u p p o rt is m a n d a t o ry in t h is ca s e . In g e n e ra l, a n y h a rd wa re d e vice d ire ct ly co n n e ct e d t o t h e I/ O b u s , s u ch a s t h e in t e rn a l h a rd d is k, is h a n d le d a cco rd in g t o t h e e xt e n d e d s u p p o rt a p p ro a ch : t h e ke rn e l m u s t p ro vid e a d e vice d rive r fo r e a ch s u ch d e vice . Ext e rn a l d e vice s a t t a ch e d t o t h e Un ive rs a l S e ria l Bu s ( US B) , t h e PCMCIA p o rt fo u n d in m a n y la p t o p s , o r t h e S CS I in t e rfa ce —in s h o rt , a n y g e n e ra l- p u rp o s e I/ O in t e rfa ce e xce p t t h e s e ria l a n d t h e p a ra lle l p o rt s —a ls o re q u ire e xt e n d e d s u p p o rt . It is wo rt h n o t in g t h a t t h e s t a n d a rd file - re la t e d s ys t e m ca lls like open( ), read( ), a n d

write( ) d o n o t a lwa ys g ive t h e a p p lica t io n fu ll co n t ro l o f t h e u n d e rlyin g h a rd wa re d e vice . In fa ct , t h e lo we s t - co m m o n - d e n o m in a t o r a p p ro a ch o f t h e VFS d o e s n o t in clu d e ro o m fo r s p e cia l co m m a n d s t h a t s o m e d e vice s n e e d o r le t a n a p p lica t io n ch e ck wh e t h e r t h e d e vice is in a s p e cific in t e rn a l s t a t e . Th e ioctl( ) s ys t e m ca ll wa s in t ro d u ce d t o s a t is fy s u ch n e e d s . Be s id e s t h e file d e s crip t o r o f t h e d e vice file a n d a s e co n d 3 2 - b it p a ra m e t e r s p e cifyin g t h e re q u e s t , t h e s ys t e m ca ll ca n a cce p t a n a rb it ra ry n u m b e r o f a d d it io n a l p a ra m e t e rs . Fo r e xa m p le , s p e cific ioctl( ) re q u e s t s e xis t t o g e t t h e CD- ROM s o u n d vo lu m e o r t o e je ct t h e CD- ROM m e d ia . Ap p lica t io n p ro g ra m s m a y p ro vid e t h e u s e r in t e rfa ce o f a CD p la ye r u s in g t h e s e kin d s o f ioctl( ) re q u e s t s .

13.3.2 Buffering Strategies of Device Drivers Tra d it io n a lly, Un ix- like o p e ra t in g s ys t e m s d ivid e h a rd wa re d e vice s in t o b lo ck a n d ch a ra ct e r d e vice s . Ho we ve r, t h is cla s s ifica t io n d o e s n o t t e ll t h e wh o le s t o ry. S o m e d e vice s a re ca p a b le o f t ra n s fe rrin g s ize a b le a m o u n t o f d a t a in a s in g le I/ O o p e ra t io n , wh ile o t h e rs t ra n s fe r o n ly a fe w ch a ra ct e rs . Fo r in s t a n ce , a PS / 2 m o u s e d rive r g e t s a fe w b yt e s in e a ch re a d o p e ra t io n —t h e y co rre s p o n d t o t h e s t a t u s o f t h e m o u s e b u t t o n a n d t o t h e p o s it io n o f t h e m o u s e p o in t e r o n t h e s cre e n . Th is kin d o f d e vice is t h e e a s ie s t t o h a n d le . In p u t d a t a is firs t re a d o n e ch a ra ct e r a t a t im e fro m t h e d e vice in p u t re g is t e r a n d s t o re d in a p ro p e r ke rn e l d a t a s t ru ct u re ; t h e d a t a is t h e n co p ie d a t le is u re in t o t h e p ro ce s s a d d re s s s p a ce . S im ila rly, o u t p u t d a t a is firs t co p ie d fro m t h e p ro ce s s a d d re s s s p a ce t o a p ro p e r ke rn e l d a t a s t ru ct u re a n d t h e n writ t e n o n e a t a t im e in t o t h e I/ O d e vice o u t p u t re g is t e r. Cle a rly, I/ O d rive rs fo r s u ch d e vice s d o n o t u s e t h e DMAC b e ca u s e t h e

CPU t im e s p e n t t o s e t u p a DMA I/ O o p e ra t io n is co m p a ra b le t o t h e o n e s p e n t t o m o ve t h e d a t a t o o r fro m t h e I/ O p o rt s . On t h e o t h e r h a n d , t h e ke rn e l m u s t a ls o b e re a d y t o d e a l wit h d e vice s t h a t yie ld a la rg e n u m b e r o f b yt e s in e a ch I/ O o p e ra t io n , e it h e r s e q u e n t ia l d e vice s s u ch a s s o u n d ca rd s o r n e t wo rk ca rd s , o r ra n d o m a cce s s d e vice s s u ch a s d is ks o f a ll kin d s ( flo p p y, CDROM, S CS I d is k, e t c. ) . S u p p o s e , fo r in s t a n ce , t h a t yo u h a ve s e t u p t h e s o u n d ca rd o f yo u r co m p u t e r s o t h a t yo u a re a b le t o re co rd s o u n d s co m in g fro m a m icro p h o n e . Th e s o u n d ca rd s a m p le s t h e e le ct rica l s ig n a l co m in g fro m t h e m icro p h o n e a t a fixe d ra t e , s a y 4 4 . 1 4 kHz, a n d p ro d u ce s a s t re a m o f 1 6 - b it n u m b e rs d ivid e d in t o b lo cks o f in p u t d a t a . Th e s o u n d ca rd d rive r m u s t b e a b le t o co p e wit h t h is a va la n ch e o f d a t a in a ll p o s s ib le s it u a t io n s , e ve n wh e n t h e CPU is t e m p o ra rily b u s y ru n n in g s o m e o t h e r p ro ce s s . Th is ca n b e d o n e b y co m b in in g t wo d iffe re n t t e ch n iq u e s : ● ●

Us e o f t h e DMA p ro ce s s o r ( DMAC) t o t ra n s fe r b lo cks o f d a t a . Us e o f a circu la r b u ffe r o f t wo o r m o re e le m e n t s , e a ch e le m e n t h a vin g t h e s ize o f a b lo ck o f d a t a . Wh e n a n in t e rru p t o ccu rs s ig n a lin g t h a t a n e w b lo ck o f d a t a h a s b e e n re a d , t h e in t e rru p t h a n d le r a d va n ce s a p o in t e r t o t h e e le m e n t s o f t h e circu la r b u ffe r s o t h a t fu rt h e r d a t a will b e s t o re d in a n e m p t y e le m e n t . Co n ve rs e ly, wh e n e ve r t h e d rive r s u cce e d s in co p yin g a b lo ck o f d a t a in t o u s e r a d d re s s s p a ce , it re le a s e s a n e le m e n t o f t h e circu la r b u ffe r s o t h a t it is a va ila b le fo r s a vin g n e w d a t a fro m t h e h a rd wa re d e vice .

Th e ro le o f t h e circu la r b u ffe r is t o s m o o t h o u t t h e p e a ks o f CPU lo a d ; e ve n if t h e Us e r Mo d e a p p lica t io n re ce ivin g t h e d a t a is s lo we d d o wn b e ca u s e o f o t h e r h ig h e r p rio rit y t a s ks , t h e DMAC is a b le t o co n t in u e fillin g e le m e n t s o f t h e circu la r b u ffe r b e ca u s e t h e in t e rru p t h a n d le r e xe cu t e s o n b e h a lf o f t h e cu rre n t ly ru n n in g p ro ce s s . A s im ila r s it u a t io n o ccu rs wh e n re ce ivin g p a cke t s fro m a n e t wo rk ca rd , e xce p t t h a t in t h is ca s e , t h e flo w o f in co m in g d a t a is a s yn ch ro n o u s . Pa cke t s a re re ce ive d in d e p e n d e n t ly fro m e a ch o t h e r a n d t h e t im e in t e rva l t h a t o ccu rs b e t we e n t wo co n s e cu t ive p a cke t a rriva ls is u n p re d ict a b le . All co n s id e re d , b u ffe r h a n d lin g fo r s e q u e n t ia l d e vice s is e a s y b e ca u s e t h e s a m e b u ffe r is n e v e r re u s e d : a n a u d io a p p lica t io n ca n n o t a s k t h e m icro p h o n e t o re t ra n s m it t h e s a m e b lo ck o f d a t a ; s im ila rly, a n e t wo rkin g a p p lica t io n ca n n o t a s k t h e n e t wo rk ca rd t o re t ra n s m it t h e s a m e p a cke t . On t h e o t h e r h a n d , b u ffe rin g fo r ra n d o m a cce s s d e vice s ( d is ks o f a n y kin d ) is m u ch m o re co m p lica t e d . In t h is ca s e , a p p lica t io n s a re e n t it le d t o a s k re p e a t e d ly t o re a d o r writ e t h e s a m e b lo ck o f d a t a . Fu rt h e rm o re , a cce s s e s t o t h e s e d e vice s a re u s u a lly ve ry s lo w. Th e s e p e cu lia rit ie s h a ve a p ro fo u n d im p a ct o n t h e s t ru ct u re o f t h e d is k d rive rs . Th u s , b u ffe rs fo r ra n d o m a cce s s d e vice s p la y a d iffe re n t ro le . In s t e a d o f s m o o t h in g o u t t h e p e a ks o f t h e CPU lo a d , t h e y a re u s e d t o co n t a in d a t a t h a t is n o lo n g e r n e e d e d b y a n y p ro ce s s , ju s t in ca s e s o m e o t h e r p ro ce s s will re q u ire t h e s a m e d a t a a t s o m e la t e r t im e . In o t h e r wo rd s , b u ffe rs a re t h e b a s ic co m p o n e n t s o f a s o ft wa re ca ch e ( s e e Ch a p t e r 1 4 ) t h a t re d u ce s t h e n u m b e r o f d is k a cce s s e s .

13.3.3 Registering a Device Driver We kn o w t h a t e a ch s ys t e m ca ll is s u e d o n a d e vice file is t ra n s la t e d b y t h e ke rn e l in t o a n in vo ca t io n o f a s u it a b le fu n ct io n o f a co rre s p o n d in g d e vice d rive r. To a ch ie ve t h is , a d e vice d rive r m u s t re g is t e r it s e lf. In o t h e r wo rd s , re g is t e rin g a d e vice d rive r m e a n s lin kin g it t o t h e co rre s p o n d in g d e vice file s . Acce s s e s t o d e vice file s wh o s e co rre s p o n d in g d rive rs h a ve n o t b e e n p re vio u s ly re g is t e re d re t u rn t h e e rro r co d e -ENODEV.

If a d e vice d rive r is s t a t ica lly co m p ile d in t h e ke rn e l, it s re g is t ra t io n is p e rfo rm e d d u rin g t h e ke rn e l in it ia liza t io n p h a s e . Co n ve rs e ly, if a d e vice d rive r is co m p ile d a s a ke rn e l m o d u le ( s e e Ap p e n d ix B) , it s re g is t ra t io n is p e rfo rm e d wh e n t h e m o d u le is lo a d e d . In t h e la t t e r ca s e , t h e d e vice d rive r ca n a ls o u n re g is t e r it s e lf wh e n t h e m o d u le is u n lo a d e d . Ch a ra ct e r d e vice d rive rs u s in g o ld - s t yle d e vice file s [ 6 ] a re d e s crib e d b y a chrdevs a rra y o f

device_struct d a t a s t ru ct u re s ; e a ch a rra y in d e x is t h e m a jo r n u m b e r o f a d e vice file . Ma jo r n u m b e rs ra n g e b e t we e n 1 ( n o d e vice file ca n h a ve t h e m a jo r n u m b e r 0 ) a n d 2 5 4 ( t h e va lu e 2 5 5 is re s e rve d fo r fu t u re e xt e n s io n s ) , t h u s t h e a rra y co n t a in s 2 5 5 e le m e n t s , b u t t h e firs t o f t h e m is n o t u s e d . Ea ch s t ru ct u re in clu d e s t wo fie ld s : name p o in t s t o t h e n a m e o f t h e d e vice cla s s a n d fops p o in t s t o a file_operations s t ru ct u re ( s e e S e ct io n 1 2 . 2 . 3 ) .

[6]

As yo u m ig h t s u s p e ct , t h e re g is t ra t io n p ro ce d u re is q u it e d iffe re n t fo r t h e o ld - s t yle d e vice file s a n d d e v fs d e vice file s . Un fo rt u n a t e ly, t h is m e a n s t h a t if b o t h t yp e s a re in u s e a t t h e s a m e t im e , a d e vice d rive r m u s t re g is t e r it s e lf t wice . S im ila rly, b lo ck d e vice d rive rs u s in g o ld - s t yle d e vice file s a re d e s crib e d b y a blkdevs a rra y o f 2 5 5 d a t a s t ru ct u re s ( a s in t h e chrdevs a rra y, t h e firs t e n t ry is n o t u s e d ) . Ea ch s t ru ct u re in clu d e s t wo fie ld s : name p o in t s t o t h e n a m e o f t h e d e vice cla s s a n d bdops p o in t s t o a

block_device_operations s t ru ct u re , wh ich s t o re s a fe w cu s t o m m e t h o d s fo r cru cia l o p e ra t io n s o f t h e b lo ck d e vice d rive r ( s e e Ta b le 1 3 - 3 ) .

Ta b le 1 3 - 3 . Th e m e t h o d s o f b lo c k d e v ic e d riv e rs

Me t h o d

Ev e n t t h a t t rig g e rs t h e in v o c a t io n o f t h e m e t h o d

open

Op e n in g t h e b lo ck d e vice file

release

Clo s in g t h e la s t re fe re n ce t o a b lo ck d e vice file

ioctl

Is s u in g a ioctl( ) s ys t e m ca ll o n t h e b lo ck d e vice file

check_media_change Ch e ckin g wh e t h e r t h e m e d ia h a s b e e n ch a n g e d ( e . g . , flo p p y d is k) revalidate

Ch e ckin g wh e t h e r t h e b lo ck d e vice h o ld s va lid d a t a

Th e chrdevs a n d blkdevs t a b le s a re in it ia lly e m p t y. Th e register_chrdev( ) a n d

register_blkdev( ) fu n ct io n s in s e rt a n e w e n t ry in t o o n e o f t h e t a b le s . If a d e vice d rive r is im p le m e n t e d t h ro u g h a m o d u le , it ca n b e u n re g is t e re d wh e n t h e m o d u le is u n lo a d e d b y m e a n s o f t h e unregister_chrdev( ) o r unregister_blkdev( ) fu n ct io n s .

Fo r e xa m p le , t h e d e s crip t o r fo r t h e p a ra lle l p rin t e r d rive r cla s s is in s e rt e d in t h e chrdevs t a b le a s fo llo ws :

register_chrdev(6, "lp", &lp_fops); Th e firs t p a ra m e t e r d e n o t e s t h e m a jo r n u m b e r, t h e s e co n d d e n o t e s t h e d e vice cla s s n a m e , a n d t h e la s t is a p o in t e r t o t h e t a b le o f file o p e ra t io n s . No t ice t h a t , o n ce re g is t e re d , a d e vice d rive r is lin ke d t o t h e m a jo r n u m b e r o f a d e vice file a n d n o t t o it s p a t h n a m e . Th u s , a n y a cce s s t o a d e vice file a ct iva t e s t h e co rre s p o n d in g d rive r, re g a rd le s s o f t h e p a t h n a m e u s e d .

13.3.4 Initializing a Device Driver Re g is t e rin g a d e vice d rive r a n d in it ia lizin g it a re t wo d iffe re n t t h in g s . A d e vice d rive r is re g is t e re d a s s o o n a s p o s s ib le s o Us e r Mo d e a p p lica t io n s ca n u s e it t h ro u g h t h e co rre s p o n d in g d e vice file s . In co n t ra s t , a d e vice d rive r is in it ia lize d a t t h e la s t p o s s ib le m o m e n t . In fa ct , in it ia lizin g a d rive r m e a n s a llo ca t in g p re cio u s re s o u rce s o f t h e s ys t e m , wh ich a re t h e re fo re n o t a va ila b le t o o t h e r d rive rs . We a lre a d y h a ve s e e n a n e xa m p le in S e ct io n 4 . 6 . 1 : t h e a s s ig n m e n t o f IRQs t o d e vice s is u s u a lly m a d e d yn a m ica lly, rig h t b e fo re u s in g t h e m , s in ce s e ve ra l d e vice s m a y s h a re t h e s a m e IRQ lin e . Ot h e r re s o u rce s t h a t ca n b e a llo ca t e d a t t h e la s t p o s s ib le m o m e n t a re p a g e fra m e s fo r DMA t ra n s fe r b u ffe rs a n d t h e DMA ch a n n e l it s e lf ( fo r o ld n o n - PCI d e vice s like t h e flo p p y d is k d rive r) . To m a ke s u re t h e re s o u rce s a re o b t a in e d wh e n n e e d e d b u t a re n o t re q u e s t e d in a re d u n d a n t m a n n e r wh e n t h e y h a ve a lre a d y b e e n g ra n t e d , d e vice d rive rs u s u a lly a d o p t t h e fo llo win g s ch e m a : ●

A u s a g e co u n t e r ke e p s t ra ck o f t h e n u m b e r o f p ro ce s s e s t h a t a re cu rre n t ly a cce s s in g t h e d e vice file . Th e co u n t e r is in cre m e n t e d in t h e open m e t h o d o f t h e d e vice file a n d d e cre m e n t e d in t h e release m e t h o d . [ 7 ]

[7]

Mo re p re cis e ly, t h e u s a g e co u n t e r ke e p s t ra ck o f t h e n u m b e r o f file o b je ct s re fe rrin g t o t h e d e vice file , s in ce clo n e p ro ce s s e s co u ld s h a re t h e s a m e file o b je ct . ●

Th e open m e t h o d ch e cks t h e va lu e o f t h e u s a g e co u n t e r b e fo re t h e in cre m e n t . If t h e



co u n t e r is n u ll, t h e d e vice d rive r m u s t a llo ca t e t h e re s o u rce s a n d e n a b le in t e rru p t s a n d DMA o n t h e h a rd wa re d e vice . Th e release m e t h o d ch e cks t h e va lu e o f t h e u s a g e co u n t e r a ft e r t h e d e cre m e n t . If t h e co u n t e r is n u ll, n o m o re p ro ce s s e s a re u s in g t h e h a rd wa re d e vice . If s o , t h e m e t h o d d is a b le s in t e rru p t s a n d DMA o n t h e I/ O co n t ro lle r, a n d t h e n re le a s e s t h e a llo ca t e d re s o u rce s .

13.3.5 Monitoring I/O Operations Th e d u ra t io n o f a n I/ O o p e ra t io n is o ft e n u n p re d ict a b le . It ca n d e p e n d o n m e ch a n ica l co n s id e ra t io n s ( t h e cu rre n t p o s it io n o f a d is k h e a d wit h re s p e ct t o t h e b lo ck t o b e t ra n s fe rre d ) , o n t ru ly ra n d o m e ve n t s ( wh e n a d a t a p a cke t a rrive s o n t h e n e t wo rk ca rd ) , o r o n h u m a n fa ct o rs ( wh e n a u s e r p re s s e s a ke y o n t h e ke yb o a rd o r wh e n h e n o t ice s t h a t a p a p e r ja m o ccu rre d in t h e p rin t e r) . In a n y ca s e , t h e d e vice d rive r t h a t s t a rt e d a n I/ O o p e ra t io n m u s t re ly o n a m o n it o rin g t e ch n iq u e t h a t s ig n a ls e it h e r t h e t e rm in a t io n o f t h e I/ O o p e ra t io n o r a t im e - o u t . In t h e ca s e o f a t e rm in a t e d o p e ra t io n , t h e d e vice d rive r re a d s t h e s t a t u s re g is t e r o f t h e I/ O in t e rfa ce t o d e t e rm in e wh e t h e r t h e I/ O o p e ra t io n wa s ca rrie d o u t s u cce s s fu lly. In t h e ca s e o f a t im e - o u t , t h e d rive r kn o ws t h a t s o m e t h in g we n t wro n g , s in ce t h e m a xim u m t im e in t e rva l

a llo we d t o co m p le t e t h e o p e ra t io n e la p s e d a n d n o t h in g h a p p e n e d . Th e t wo t e ch n iq u e s a va ila b le t o m o n it o r t h e e n d o f a n I/ O o p e ra t io n a re ca lle d t h e p o llin g m o d e a n d t h e in t e rru p t m o d e .

13.3.5.1 Polling mode Acco rd in g t o t h is t e ch n iq u e , t h e CPU ch e cks ( p o lls ) t h e d e vice 's s t a t u s re g is t e r re p e a t e d ly u n t il it s va lu e s ig n a ls t h a t t h e I/ O o p e ra t io n h a s b e e n co m p le t e d . We h a ve a lre a d y e n co u n t e re d a t e ch n iq u e b a s e d o n p o llin g in S e ct io n 5 . 3 . 3 : wh e n a p ro ce s s o r t rie s t o a cq u ire a b u s y s p in lo ck, it re p e a t e d ly p o lls t h e va ria b le u n t il it s va lu e b e co m e s 0 . Ho we ve r, p o llin g a p p lie d t o I/ O o p e ra t io n s is u s u a lly m o re e la b o ra t e , s in ce t h e d rive r m u s t a ls o re m e m b e r t o ch e ck fo r p o s s ib le t im e - o u t s . A s im p le e xa m p le o f p o llin g lo o ks like t h e fo llo win g :

for (;;) { if (read_status(device) & DEVICE_END_OPERATION) break; if (--count == 0) break; } Th e count va ria b le , wh ich wa s in it ia lize d b e fo re e n t e rin g t h e lo o p , is d e cre m e n t e d a t e a ch it e ra t io n , a n d t h u s ca n b e u s e d t o im p le m e n t a ro u g h t im e - o u t m e ch a n is m . Alt e rn a t ive ly, a m o re p re cis e t im e - o u t m e ch a n is m co u ld b e im p le m e n t e d b y re a d in g t h e va lu e o f t h e t ick co u n t e r jiffies a t e a ch it e ra t io n ( s e e S e ct io n 6 . 2 . 1 . 1 ) a n d co m p a rin g it wit h t h e o ld va lu e re a d b e fo re s t a rt in g t h e wa it lo o p . If t h e t im e re q u ire d t o co m p le t e t h e I/ O o p e ra t io n is re la t ive ly h ig h , s a y in t h e o rd e r o f m illis e co n d s , t h is s ch e m a b e co m e s in e fficie n t b e ca u s e t h e CPU wa s t e s p re cio u s m a ch in e cycle s wh ile wa it in g fo r t h e I/ O co m p le t io n . In s u ch ca s e s , it is p re fe ra b le t o vo lu n t a rily re lin q u is h t h e CPU a ft e r e a ch p o llin g o p e ra t io n b y in s e rt in g a n in vo ca t io n o f t h e schedule( ) fu n ct io n in s id e t h e lo o p .

13.3.5.2 Interrupt mode In t e rru p t m o d e ca n b e u s e d o n ly if t h e I/ O co n t ro lle r is ca p a b le o f s ig n a lin g , via a n IRQ lin e , t h e e n d o f a n I/ O o p e ra t io n . We 'll s h o w h o w in t e rru p t m o d e wo rks o n a s im p le ca s e . Le t 's s u p p o s e we wa n t t o im p le m e n t a d rive r fo r a s im p le in p u t ch a ra ct e r d e vice . Wh e n t h e u s e r is s u e s a read( ) s ys t e m ca ll o n t h e co rre s p o n d in g d e vice file , a n in p u t co m m a n d is s e n t t o t h e d e vice 's co n t ro l re g is t e r. Aft e r a n u n p re d ict a b ly lo n g t im e in t e rva l, t h e d e vice p u t s a s in g le b yt e o f d a t a in it s in p u t re g is t e r. Th e d e vice d rive r t h e n re t u rn s t h is b yt e a s re s u lt o f t h e read( ) s ys t e m ca ll.

Th is is a t yp ica l ca s e in wh ich it is p re fe ra b le t o im p le m e n t t h e d rive r u s in g t h e in t e rru p t m o d e ; in fa ct , t h e d e vice d rive r d o e s n 't kn o w in a d va n ce h o w m u ch t im e it h a s t o wa it fo r a n a n s we r fro m t h e h a rd wa re d e vice . Es s e n t ia lly, t h e d rive r in clu d e s t wo fu n ct io n s : 1 . Th e foo_read( ) fu n ct io n t h a t im p le m e n t s t h e read m e t h o d o f t h e file o b je ct

2 . Th e foo_interrupt( ) fu n ct io n t h a t h a n d le s t h e in t e rru p t

Th e foo_read( ) fu n ct io n is t rig g e re d wh e n e ve r t h e u s e r re a d s t h e d e vice file :

ssize_t foo_read(struct file *filp, char *buf, size_t count, loff_t *ppos) { foo_dev_t * foo_dev = filp->private_data; if (down_interruptible(&foo_dev->sem) return -ERESTARTSYS; foo_dev->intr = 0; outb(DEV_FOO_READ, DEV_FOO_CONTROL_PORT); wait_event_interruptible(foo_dev->wait, (foo_dev->intr= =1)); if (put_user(foo_dev->data, buf)) return -EFAULT; up(&foo_dev->sem); return 1; } Th e d e vice d rive r re lie s o n a cu s t o m d e s crip t o r o f t yp e foo_dev_t; it in clu d e s a s e m a p h o re

sem t h a t p ro t e ct s t h e h a rd wa re d e vice fro m co n cu rre n t a cce s s e s , a wa it q u e u e wait, a fla g intr t h a t is s e t wh e n t h e d e vice is s u e s a n in t e rru p t , a n d a s in g le - b yt e b u ffe r data t h a t is writ t e n b y t h e in t e rru p t h a n d le r a n d re a d b y t h e read m e t h o d . In g e n e ra l, a ll I/ O d rive rs t h a t u s e in t e rru p t s re ly o n d a t a s t ru ct u re s a cce s s e d b y b o t h t h e in t e rru p t h a n d le r a n d t h e read a n d write m e t h o d s . Th e a d d re s s o f t h e foo_dev_t d e s crip t o r is u s u a lly s t o re d in t h e private_data fie ld o f t h e d e vice file 's file o b je ct o r in a g lo b a l va ria b le . Th e m a in o p e ra t io n s o f t h e foo_read( ) fu n ct io n a re t h e fo llo win g :

1 . Acq u ire s t h e foo_dev->sem s e m a p h o re , t h u s e n s u rin g t h a t n o o t h e r p ro ce s s is a cce s s in g t h e d e vice . 2 . Cle a rs t h e intr fla g .

3 . Is s u e s t h e re a d co m m a n d t o t h e I/ O d e vice . 4 . Exe cu t e s wait_event_interruptible t o s u s p e n d t h e p ro ce s s u n t il t h e intr fla g b e co m e s 1 . Th is m a cro is d e s crib e d in S e ct io n 3 . 2 . 4 . 1 . Aft e r s o m e t im e , o u r d e vice is s u e s a n in t e rru p t t o s ig n a l t h a t t h e I/ O o p e ra t io n is co m p le t e d a n d t h a t t h e d a t a is re a d y in t h e p ro p e r DEV_FOO_DATA_PORT d a t a p o rt . Th e in t e rru p t h a n d le r s e t s t h e intr fla g a n d wa ke s t h e p ro ce s s . Wh e n t h e s ch e d u le r d e cid e s t o re e xe cu t e t h e p ro ce s s , t h e s e co n d p a rt o f foo_read( ) is e xe cu t e d a n d d o e s t h e fo llo win g :

1 . Co p ie s t h e ch a ra ct e r re a d y in t h e foo_dev->data va ria b le in t o t h e u s e r a d d re s s s p a ce .

2 . Te rm in a t e s a ft e r re le a s in g t h e foo_dev->sem s e m a p h o re .

Fo r s im p licit y, we d id n 't in clu d e a n y t im e - o u t co n t ro l. In g e n e ra l, t im e - o u t co n t ro l is im p le m e n t e d t h ro u g h s t a t ic o r d yn a m ic t im e rs ( s e e Ch a p t e r 6 ) ; t h e t im e r m u s t b e s e t t o t h e rig h t t im e b e fo re s t a rt in g t h e I/ O o p e ra t io n a n d re m o ve d wh e n t h e o p e ra t io n t e rm in a t e s . Le t 's n o w lo o k a t t h e co d e o f t h e foo_interrupt( ) fu n ct io n :

void foo_interrupt(int irq, void *dev_id, struct pt_regs *regs) {

foo->data = inb(DEV_FOO_DATA_PORT); foo->intr = 1; wake_up_interruptible(&foo->wait); } Th e in t e rru p t h a n d le r re a d s t h e ch a ra ct e r fro m t h e in p u t re g is t e r o f t h e d e vice a n d s t o re s it in t h e data fie ld o f t h e foo_dev_t d e s crip t o r o f t h e d e vice d rive r p o in t e d t o b y t h e foo g lo b a l va ria b le . It t h e n s e t s t h e intr fla g a n d in vo ke s wake_up_interruptible( ) t o wa ke t h e p ro ce s s b lo cke d in t h e foo->wait wa it q u e u e .

No t ice t h a t n o n e o f t h e t h re e p a ra m e t e rs a re u s e d b y o u r in t e rru p t h a n d le r. Th is is a ra t h e r co m m o n ca s e .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

13.4 Block Device Drivers Typ ica l b lo ck d e vice s like h a rd d is ks h a ve ve ry h ig h a ve ra g e a cce s s t im e s . Ea ch o p e ra t io n re q u ire s s e ve ra l m illis e co n d s t o co m p le t e , m a in ly b e ca u s e t h e h a rd d is k co n t ro lle r m u s t m o ve t h e h e a d s o n t h e d is k s u rfa ce t o re a ch t h e e xa ct p o s it io n wh e re t h e d a t a is re co rd e d . Ho we ve r, wh e n t h e h e a d s a re co rre ct ly p la ce d , d a t a t ra n s fe r ca n b e s u s t a in e d a t ra t e s o f t e n s o f m e g a b yt e s p e r s e co n d . To a ch ie ve a cce p t a b le p e rfo rm a n ce , h a rd d is ks a n d s im ila r d e vice s t ra n s fe r s e ve ra l a d ja ce n t b yt e s a t o n ce . In t h e fo llo win g d is cu s s io n , we s a y t h a t g ro u p s o f b yt e s a re a d ja ce n t wh e n t h e y a re re co rd e d o n t h e d is k s u rfa ce in s u ch a m a n n e r t h a t a s in g le s e e k o p e ra t io n ca n a cce s s t h e m . Th e o rg a n iza t io n o f Lin u x b lo ck d e vice h a n d le rs is q u it e in vo lve d . We wo n 't b e a b le t o d is cu s s in d e t a il a ll t h e fu n ct io n s t h a t a re in clu d e d in t h e ke rn e l t o s u p p o rt t h e h a n d le rs . Bu t we o u t lin e t h e g e n e ra l s o ft wa re a rch it e ct u re a n d in t ro d u ce t h e m a in d a t a s t ru ct u re s . Ke rn e l s u p p o rt fo r b lo ck d e vice h a n d le rs in clu d e s t h e fo llo win g fe a t u re s : ● ● ●

A u n ifo rm in t e rfa ce t h ro u g h t h e VFS Efficie n t re a d - a h e a d o f d is k d a t a Dis k ca ch in g fo r t h e d a t a

13.4.1 Keeping Track of Block Device Drivers Wh e n a b lo ck d e vice file is b e in g o p e n e d , t h e ke rn e l m u s t d e t e rm in e wh e t h e r t h e d e vice file is a lre a d y o p e n . In fa ct , if t h e file is a lre a d y o p e n , t h e ke rn e l m u s t n o t in it ia lize t h e co rre s p o n d in g b lo ck d e vice d rive r. Th is p ro b le m is a s e a s y a s it a p p e a rs a t firs t lo o k. On t h e o n e h a n d , we s t a t e d in t h e e a rlie r s e ct io n S e ct io n 1 3 . 2 t h a t b lo ck d e vice file s t h a t h a ve t h e s a m e m a jo r n u m b e r a re u s u a lly a s s o cia t e d wit h t h e s a m e b lo ck d e vice d rive r. Ho we ve r, e a ch b lo ck d e vice d rive r t h a t h a n d le s m o re t h a n o n e m in o r n u m b e r ca n b e co n s id e re d s e ve ra l s p e cia lize d b lo ck d e vice d rive rs , s o t h is ca s e d o e s n 't cre a t e p ro b le m s . In t h e re s t o f t h is s e ct io n , wh e n we u s e t h e t e rm "b lo ck d e vice d rive r, " we m e a n t h e ke rn e l la ye r t h a t h a n d le s I/ O d a t a t ra n s fe rs fro m / t o a h a rd wa re d e vice s p e cifie d b y b o t h a m a jo r n u m b e r a n d a m in o r n u m b e r. A re a l co m p lica t io n , h o we ve r, is t h a t b lo ck d e vice file s t h a t h a ve t h e s a m e m a jo r a n d m in o r n u m b e rs b u t d iffe re n t p a t h n a m e s a re re g a rd e d b y t h e VFS a s d iffe re n t file s , b u t t h e y re a lly re fe r t o t h e s a m e b lo ck d e vice d rive r. Th e re fo re , t h e ke rn e l ca n n o t d e t e rm in e wh e t h e r a b lo ck d e vice d rive r is a lre a d y in u s e b y s im p ly ch e ckin g fo r t h e e xis t e n ce in t h e in o d e ca ch e o f a n o b je ct fo r t h e b lo ck d e vice file . To ke e p t ra ck o f wh ich b lo ck d e vice d rive rs a re cu rre n t ly in u s e , t h e ke rn e l u s e s a h a s h t a b le in d e xe d b y t h e m a jo r a n d m in o r n u m b e rs . Eve ry t im e a b lo ck d e vice d rive r is b e in g u s e d , t h e ke rn e l ch e cks wh e t h e r t h e co rre s p o n d in g b lo ck d e vice d rive r id e n t ifie d b y t h e m a jo r a n d m in o r n u m b e rs is a lre a d y s t o re d in t h e h a s h t a b le . If s o , t h e b lo ck d e vice d rive r is a lre a d y in u s e ; n o t ice t h a t t h e h a s h fu n ct io n wo rks o n t h e m a jo r a n d m in o r n u m b e rs o f t h e b lo ck d e vice file , t h u s it d o e s n 't m a t t e r wh e t h e r t h e b lo ck d e vice d rive r wa s p re vio u s ly a ct iva t e d b y a cce s s in g a g ive n b lo ck d e vice file , o r a n o t h e r o n e t h a t h a s t h e s a m e m a jo r a n d m in o r n u m b e rs . Co n ve rs e ly, if a b lo ck d e vice d rive r a s s o cia t e d wit h t h e g ive n m a jo r a n d m in o r n u m b e rs is n o t fo u n d , t h e ke rn e l in s e rt s a n e w e le m e n t in t o t h e h a s h t a b le .

Th e h a s h t a b le a rra y is s t o re d in bdev_hashtable va ria b le ; it in clu d e s 6 4 lis t s o f b lo ck d e vice d e s crip t o rs . Ea ch d e s crip t o r is a block_device d a t a s t ru ct u re wh o s e fie ld s a re s h o wn in Ta b le 1 3 - 4 .

Ta b le 1 3 - 4 . Th e fie ld s o f t h e b lo c k d e v ic e d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

struct list_head

bd_hash

Po in t e rs fo r t h e h a s h t a b le lis t

atomic_t

bd_count

Us a g e co u n t e r fo r t h e b lo ck d e vice d e s crip t o r

struct inode *

bd_inode

Po in t e r t o t h e m a in in o d e o b je ct o f t h e b lo ck d e vice d rive r

dev_t

bd_dev

Ma jo r a n d m in o r n u m b e rs o f t h e b lo ck d e vice

int

bd_openers b lo ck d e vice d rive r h a s b e e n

Co u n t e r o f h o w m a n y t im e s t h e ope ne d

struct block_device_operations * b d _ o p

Po in t e r t o t h e b lo ck d e vice d rive r o p e ra t io n t a b le

struct semaphore

bd_sem

S e m a p h o re p ro t e ct in g t h e b lo ck d e vice d rive r

struct list_head

b d _ in o d e s

Lis t o f in o d e s o f o p e n e d b lo ck d e vice file s fo r t h is d rive r

Th e bd_inodes fie ld o f t h e b lo ck d e vice d e s crip t o r s t o re s t h e h e a d ( t h e firs t d u m m y e le m e n t ) o f a d o u b ly lin ke d circu la r lis t o f in o d e s re la t ive t o o p e n e d b lo ck d e vice file s t h a t re fe r t o t h e b lo ck d e vice d rive r. Th e i_devices fie ld o f t h e in o d e o b je ct s t o re s t h e p o in t e rs fo r t h e p re vio u s a n d n e xt e le m e n t in t h is lis t . Ea ch b lo ck d e vice d e s crip t o r s t o re s in t h e bd_inode fie ld t h e a d d re s s o f a s p e cia l b lo ck d e v ice in o d e o b je ct fo r t h e d rive r. Th is in o d e d o e s n 't co rre s p o n d t o a d is k file ; ra t h e r, it b e lo n g s t o t h e b d e v s p e cia l file s ys t e m ( s e e S e ct io n 1 2 . 3 . 1 ) . Es s e n t ia lly, t h e b lo ck d e vice in o d e s t o re s t h e "m a s t e r co p y" o f t h e in fo rm a t io n s h a re d b y a ll in o d e o b je ct s o f t h e b lo ck d e vice file s t h a t re fe r t o t h e s a m e b lo ck d e vice .

13.4.2 Initializing a Block Device Driver

Le t 's n o w d e s crib e h o w a b lo ck d e vice d rive r is in it ia lize d . We a lre a d y d e s crib e d h o w t h e ke rn e l cu s t o m ize s t h e m e t h o d s o f t h e file o b je ct wh e n a b lo ck d e vice file is o p e n e d in S e ct io n 1 3 . 2 . 3 . It s f_op fie ld is s e t t o t h e a d d re s s o f t h e def_blk_fops va ria b le . Th e co n t e n t s o f t h is t a b le a re s h o wn in Ta b le 1 3 - 5 . Th e dentry_open( ) fu n ct io n ch e cks wh e t h e r t h e open m e t h o d is d e fin e d ; t h is is a lwa ys t ru e fo r a b lo ck d e vice file , s o t h e

blkdev_open( ) fu n ct io n is e xe cu t e d .

Ta b le 1 3 - 5 . Th e d e fa u lt file o p e ra t io n m e t h o d s fo r b lo c k d e v ic e file s

Me t h o d

Fu n c t io n fo r b lo c k d e v ic e file

open

blkdev_open( )

release

blkdev_close( )

llseek

block_llseek( )

read

generic_file_read( )

write

generic_file_write( )

mmap

generic_file_mmap( )

fsync

block_fsync( )

ioctl

blkdev_ioctl( )

Th is fu n ct io n ch e cks wh e t h e r t h e b lo ck d e vice d rive r is a lre a d y in u s e :

bd_acquire(inode); do_open(inode->i_bdev, filp); Th e bd_acquire( ) fu n ct io n e s s e n t ia lly e xe cu t e s t h e fo llo win g o p e ra t io n s :

1 . Ch e cks wh e t h e r t h e b lo ck d e vice file co rre s p o n d in g t o t h e in o d e o b je ct is a lre a d y o p e n ( in t h is ca s e , inode->i_bdev fie ld p o in t s t o t h e b lo ck d e vice d e s crip t o r) . If t h e file is a lre a d y o p e n , in cre m e n t s t h e u s a g e co u n t e r o f t h e b lo ck d e vice d e s crip t o r ( inode->i_bdev->bd_count) a n d re t u rn s .

2 . Lo o ks u p t h e b lo ck d e vice d rive r in t h e h a s h t a b le u s in g t h e m a jo r a n d m in o r n u m b e rs s t o re d in inode->rdev. If t h e d e s crip t o r is n o t fo u n d b e ca u s e t h e d rive r is n o t in u s e , a llo ca t e s a n e w block_device a n d a n e w in o d e o b je ct fo r t h e b lo ck d e vice , a n d t h e n in s e rt s t h e n e w d e s crip t o r in t h e h a s h t a b le .

3 . S t o re s t h e a d d re s s o f t h e b lo ck d e vice d rive r d e s crip t o r in inode->i_bdev.

4 . Ad d s inode t o t h e lis t o f in o d e s o f t h e d rive r d e s crip t o r.

Ne xt , blkdev_open( ) in vo ke s do_open( ), wh ich e xe cu t e s t h e fo llo win g m a in s t e p s :

1 . If t h e bd_op fie ld o f t h e b lo ck d e vice d rive r d e s crip t o r is NULL, in it ia lize s it fro m t h e

blkdevs t a b le 's e le m e n t co rre s p o n d in g t o t h e m a jo r n u m b e r o f t h e b lo ck d e vice file 2 . In vo ke s t h e open m e t h o d o f t h e b lo ck d e vice d rive r d e s crip t o r ( bd_op->open) if it is d e fin e d 3 . In cre m e n t s t h e bd_openers co u n t e r o f t h e b lo ck d e vice d rive r d e s crip t o r

4 . S e t s t h e i_size a n d i_blkbits fie ld s o f t h e b lo ck d e vice in o d e o b je ct ( bd_inode)

Th e open m e t h o d o f t h e b lo ck d e vice d rive r d e s crip t o r ca n fu rt h e r cu s t o m ize t h e m e t h o d s o f t h e b lo ck d e vice d rive r, a llo ca t e re s o u rce s , a n d t a ke o t h e r m e a s u re s b a s e d o n t h e m in o r n u m b e r o f t h e b lo ck d e vice file . Am o n g o t h e r t h in g s , t h e d e vice d rive r in it ia liza t io n fu n ct io n m u s t d e t e rm in e t h e s ize o f t h e p h ys ica l b lo ck d e vice co rre s p o n d in g t o t h e d e vice file . Th is le n g t h , re p re s e n t e d in 1 , 0 2 4 - b yt e u n it s , is s t o re d in t h e blk_size g lo b a l a rra y in d e xe d b y b o t h t h e m a jo r a n d m in o r n u m b e r o f t h e d e vice file .

13.4.3 Sectors, Blocks, and Buffers Ea ch d a t a t ra n s fe r o p e ra t io n fo r a b lo ck d e vice a ct s o n a g ro u p o f a d ja ce n t b yt e s ca lle d a s e ct o r. In m o s t d is k d e vice s , t h e s ize o f a s e ct o r is 5 1 2 b yt e s , a lt h o u g h t h e re a re d e vice s t h a t u s e la rg e r s e ct o rs ( 1 , 0 2 4 a n d 2 , 0 4 8 b yt e s ) . No t ice t h a t t h e s e ct o r s h o u ld b e co n s id e re d t h e b a s ic u n it o f d a t a t ra n s fe r; it is n e ve r p o s s ib le t o t ra n s fe r le s s t h a n a s e ct o r, a lt h o u g h m o s t d is k d e vice s a re ca p a b le o f t ra n s fe rrin g s e ve ra l a d ja ce n t s e ct o rs a t o n ce . Th e ke rn e l s t o re s t h e s e ct o r s ize o f e a ch h a rd wa re b lo ck d e vice in a t a b le n a m e d

hardsect_size. Ea ch e le m e n t in t h e t a b le is in d e xe d b y t h e m a jo r n u m b e r a n d t h e m in o r n u m b e r o f t h e co rre s p o n d in g b lo ck d e vice file . Th u s , hardsect_size[3][2] re p re s e n t s t h e s e ct o r s ize o f / d e v / h d a 2 , wh ich is t h e s e co n d p rim a ry p a rt it io n o f t h e firs t IDE d is k ( s e e Ta b le 1 3 - 2 ) . If hardsect_size[m a j ] is NULL, a ll b lo ck d e vice s s h a rin g t h e m a jo r n u m b e r m a j h a ve a s t a n d a rd s e ct o r s ize o f 5 1 2 b yt e s . Blo ck d e vice d rive rs t ra n s fe r a la rg e n u m b e r o f a d ja ce n t b yt e s ca lle d a b lo ck in a s in g le o p e ra t io n . A b lo ck s h o u ld n o t b e co n fu s e d wit h a s e ct o r. Th e s e ct o r is t h e b a s ic u n it o f d a t a t ra n s fe r fo r t h e h a rd wa re d e vice , wh ile t h e b lo ck is s im p ly a g ro u p o f a d ja ce n t b yt e s in vo lve d in a n I/ O o p e ra t io n re q u e s t e d b y a d e vice d rive r. In Lin u x, t h e b lo ck s ize m u s t b e a p o we r o f 2 a n d ca n n o t b e la rg e r t h a n a p a g e fra m e . Mo re o ve r, it m u s t b e a m u lt ip le o f t h e s e ct o r s ize , s in ce e a ch b lo ck m u s t in clu d e a n in t e g ra l n u m b e r o f s e ct o rs . Th e re fo re , o n PC a rch it e ct u re , t h e p e rm it t e d b lo ck s ize s a re 5 1 2 , 1 , 0 2 4 , 2 , 0 4 8 , a n d 4 , 0 9 6 b yt e s . Th e s a m e b lo ck d e vice d rive r m a y o p e ra t e wit h s e ve ra l b lo ck s ize s ,

s in ce it h a s t o h a n d le a s e t o f d e vice file s s h a rin g t h e s a m e m a jo r n u m b e r, wh ile e a ch b lo ck d e vice file h a s it s o wn p re d e fin e d b lo ck s ize . Fo r in s t a n ce , a b lo ck d e vice d rive r co u ld h a n d le a h a rd d is k wit h t wo p a rt it io n s co n t a in in g a n Ext 2 file s ys t e m a n d a s wa p a re a ( s e e Ch a p t e r 1 6 a n d Ch a p t e r 1 7 ) . In t h is ca s e , t h e d e vice d rive r u s e s t wo d iffe re n t b lo ck s ize s : 1 , 0 2 4 b yt e s fo r t h e Ext 2 p a rt it io n a n d 4 , 0 9 6 b yt e s fo r t h e s wa p p a rt it io n . Th e ke rn e l s t o re s t h e b lo ck s ize in a t a b le n a m e d blksize_size; e a ch e le m e n t in t h e t a b le is in d e xe d b y t h e m a jo r n u m b e r a n d t h e m in o r n u m b e r o f t h e co rre s p o n d in g b lo ck d e vice file . If blksize_size[m a j ] is NULL, a ll b lo ck d e vice s s h a rin g t h e m a jo r n u m b e r m a j h a ve a s t a n d a rd b lo ck s ize o f 1 , 0 2 4 b yt e s . ( Yo u s h o u ld n o t co n fu s e blk_size wit h t h e

blksize_size a rra y, wh ich s t o re s t h e b lo ck s ize o f t h e b lo ck d e vice s ra t h e r t h a n t h e s ize o f t h e b lo ck d e vice t h e m s e lve s . ) Ea ch b lo ck re q u ire s it s o wn b u ffe r, wh ich is a RAM m e m o ry a re a u s e d b y t h e ke rn e l t o s t o re t h e b lo ck's co n t e n t . Wh e n a d e vice d rive r re a d s a b lo ck fro m d is k, it fills t h e co rre s p o n d in g b u ffe r wit h t h e va lu e s o b t a in e d fro m t h e h a rd wa re d e vice ; s im ila rly, wh e n a d e vice d rive r writ e s a b lo ck o n d is k, it u p d a t e s t h e co rre s p o n d in g g ro u p o f a d ja ce n t b yt e s o n t h e h a rd wa re d e vice wit h t h e a ct u a l va lu e s o f t h e a s s o cia t e d b u ffe r. Th e s ize o f a b u ffe r a lwa ys m a t ch e s t h e s ize o f t h e co rre s p o n d in g b lo ck.

13.4.4 Buffer Heads Th e b u ffe r h e a d is a d e s crip t o r o f t yp e buffer_head a s s o cia t e d wit h e a ch b u ffe r. It co n t a in s a ll t h e in fo rm a t io n n e e d e d b y t h e ke rn e l t o kn o w h o w t o h a n d le t h e b u ffe r; t h u s , b e fo re o p e ra t in g o n e a ch b u ffe r, t h e ke rn e l ch e cks it s b u ffe r h e a d . Th e b u ffe r h e a d fie ld s a re lis t e d in Ta b le 1 3 - 6 . Th e b_data fie ld o f e a ch b u ffe r h e a d s t o re s t h e s t a rt in g a d d re s s o f t h e co rre s p o n d in g b u ffe r. S in ce a p a g e fra m e m a y s t o re s e ve ra l b u ffe rs , t h e b_this_page fie ld p o in t s t o t h e b u ffe r h e a d o f t h e n e xt b u ffe r in t h e p a g e . Th is fie ld fa cilit a t e s t h e s t o ra g e a n d re t rie va l o f e n t ire p a g e fra m e s ( s e e S e ct io n 1 3 . 4 . 8 . 2 la t e r in t h is ch a p t e r) . Th e b_blocknr fie ld s t o re s t h e lo g ica l b lo ck n u m b e r ( i. e . , t h e in d e x o f t h e b lo ck in s id e t h e d is k p a rt it io n ) .

Ta b le 1 3 - 6 . Th e fie ld s o f a b u ffe r h e a d

Ty p e

Fie ld

D e s c rip t io n

struct buffer_head * b_next

Ne xt it e m in co llis io n h a s h lis t

unsigned long

b_blocknr

Lo g ica l b lo ck n u m b e r

unsigned short

b_size

Blo ck s ize

unsigned short

b_list

LRU lis t in clu d in g t h e b u ffe r h e a d

kdev_t

b_dev

Virt u a l d e vice id e n t ifie r

atomic_t

b_count

Blo ck u s a g e co u n t e r

kdev_t

b_rdev

Re a l d e vice id e n t ifie r

unsigned long

b_state

Bu ffe r s t a t u s fla g s

unsigned long

b_flushtime

Flu s h in g t im e fo r b u ffe r

struct buffer_head * b_next_free

Ne xt it e m in lis t o f b u ffe r h e a d s

struct buffer_head * b_prev_free

Pre vio u s it e m in lis t o f b u ffe r h e a d s

struct buffer_head * b_this_page

Pe r- p a g e b u ffe r lis t

struct buffer_head * b_reqnext

Ne xt it e m in t h e re q u e s t q u e u e

struct buffer_head ** b_pprev

Pre vio u s it e m in co llis io n h a s h lis t

char *

b_data

Po in t e r t o b u ffe r

struct page *

b_page

Po in t e r t o t h e d e s crip t o r o f t h e p a g e t h a t s t o re s t h e b u ffe r

void (*)( )

b_end_io

I/ O co m p le t io n m e t h o d

void (*)

b_private

S p e cia lize d d e vice d rive r d a t a

unsigned long

b_rsector

Blo ck n u m b e r o n re a l d e vice

wait_queue_head_t

b_wait

Bu ffe r wa it q u e u e

struct inode *

b_inode

Po in t e r t o in o d e o b je ct t o wh ich t h e b u ffe r b e lo n g s

struct list_head

b_inode_buffers Po in t e rs fo r lis t o f in o d e b u ffe rs

Th e b_state fie ld s t o re s t h e fo llo win g fla g s :

BH_Uptodate S e t if t h e b u ffe r co n t a in s va lid d a t a . Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_uptodate( ) m a cro .

BH_Dirty S e t if t h e b u ffe r is d irt y—t h a t is , if it co n t a in s d a t a t h a t m u s t b e writ t e n t o t h e b lo ck d e vice . Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_dirty( ) m a cro .

BH_Lock S e t if t h e b u ffe r is lo cke d , wh ich h a p p e n s if t h e b u ffe r is in vo lve d in a d is k t ra n s fe r. Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_locked( ) m a cro .

BH_Req S e t if t h e co rre s p o n d in g b lo ck is re q u e s t e d ( s e e t h e n e xt s e ct io n ) a n d h a s va lid ( u p t o - d a t e ) d a t a . Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_req( ) m a cro .

BH_Mapped S e t if t h e b u ffe r is m a p p e d t o d is k—t h a t is , if t h e b_dev a n d b_blocknr fie ld s o f t h e co rre s p o n d in g b u ffe r h e a d a re s ig n ifica n t . Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_mapped( ) m a cro .

BH_New S e t if t h e co rre s p o n d in g file b lo ck h a s ju s t b e e n a llo ca t e d a n d h a s n e ve r b e e n a cce s s e d . Th e va lu e o f t h is fla g is re t u rn e d b y t h e buffer_new( ) m a cro .

BH_Async S e t if t h e b u ffe r is b e in g p ro ce s s e d b y end_buffer_io_async( ) ( d e s crib e d in t h e la t e r s e ct io n S e ct io n 1 3 . 4 . 8 . 2 ) . Th e va lu e o f t h is fla g is re t u rn e d b y t h e

buffer_async( ) m a cro . BH_Wait_IO Us e d t o d e la y flu s h in g t h e b u ffe r t o d is k wh e n re cla im in g m e m o ry ( s e e Ch a p t e r 1 6 ) .

BH_launder S e t wh e n t h e b u ffe r is b e in g flu s h e d t o d is k wh e n re cla im in g m e m o ry ( s e e Ch a p t e r 16).

BH_JBD S e t if t h e b u ffe r is u s e d b y a jo u rn a lin g file s ys t e m ( s e e Ch a p t e r 1 7 ) .

Th e b_dev fie ld id e n t ifie s t h e virt u a l d e vice co n t a in in g t h e b lo ck s t o re d in t h e b u ffe r, wh ile t h e b_rdev fie ld id e n t ifie s t h e re a l d e vice . Th is d is t in ct io n , wh ich is m e a n in g le s s fo r s im p le h a rd d is ks , h a s b e e n in t ro d u ce d t o m o d e l Re d u n d a n t Arra y o f In d e p e n d e n t Dis ks ( RAID) s t o ra g e u n it s co n s is t in g o f s e ve ra l d is ks o p e ra t in g in p a ra lle l. Fo r re a s o n s o f s a fe t y a n d e fficie n cy, file s s t o re d in a RAID a rra y a re s ca t t e re d a cro s s s e ve ra l d is ks t h a t t h e a p p lica t io n s t h in k o f a s a s in g le lo g ica l d is k. Be s id e s t h e b_blocknr fie ld re p re s e n t in g t h e lo g ica l b lo ck n u m b e r, it is n e ce s s a ry t o s p e cify t h e s p e cific d is k u n it in t h e b_rdev fie ld a n d t h e co rre s p o n d in g s e ct o r n u m b e r in t h e b_rsector fie ld .

13.4.5 An Overview of Block Device Driver Architecture Alt h o u g h b lo ck d e vice d rive rs a re a b le t o t ra n s fe r a s in g le b lo ck a t a t im e , t h e ke rn e l d o e s n o t p e rfo rm a n in d ivid u a l I/ O o p e ra t io n fo r e a ch b lo ck t o b e a cce s s e d o n d is k; t h is wo u ld le a d t o p o o r d is k p e rfo rm a n ce s , s in ce lo ca t in g t h e p h ys ica l p o s it io n o f a b lo ck o n t h e d is k s u rfa ce is q u it e t im e - co n s u m in g . In s t e a d , t h e ke rn e l t rie s , wh e n e ve r p o s s ib le , t o clu s t e r s e ve ra l b lo cks a n d h a n d le t h e m a s a wh o le , t h u s re d u cin g t h e a ve ra g e n u m b e r o f h e a d m o ve m e n t s . Wh e n a p ro ce s s , t h e VFS la ye r, o r a n y o t h e r ke rn e l co m p o n e n t wis h e s t o re a d o r writ e a d is k b lo ck, it a ct u a lly cre a t e s a b lo ck d e v ice re q u e s t . Th a t re q u e s t e s s e n t ia lly d e s crib e s t h e re q u e s t e d b lo ck a n d t h e kin d o f o p e ra t io n t o b e p e rfo rm e d o n it ( re a d o r writ e ) . Ho we ve r, t h e ke rn e l d o e s n o t s a t is fy a re q u e s t a s s o o n a s it is cre a t e d —t h e I/ O o p e ra t io n is ju s t s ch e d u le d a n d will b e p e rfo rm e d a t a la t e r t im e . Th is a rt ificia l d e la y is p a ra d o xica lly t h e cru cia l m e ch a n is m fo r b o o s t in g t h e p e rfo rm a n ce o f b lo ck d e vice s . Wh e n a n e w b lo ck d a t a t ra n s fe r is re q u e s t e d , t h e ke rn e l ch e cks wh e t h e r it ca n b e s a t is fie d b y s lig h t ly e n la rg in g a p re vio u s re q u e s t t h a t is s t ill wa it in g ( i. e . , wh e t h e r t h e n e w re q u e s t ca n b e s a t is fie d wit h o u t fu rt h e r s e e k o p e ra t io n s ) . S in ce d is ks t e n d t o b e a cce s s e d s e q u e n t ia lly, t h is s im p le m e ch a n is m is ve ry e ffe ct ive . De fe rrin g re q u e s t s co m p lica t e s b lo ck d e vice h a n d lin g . Fo r in s t a n ce , s u p p o s e a p ro ce s s o p e n s a re g u la r file a n d , co n s e q u e n t ly, a file s ys t e m d rive r wa n t s t o re a d t h e co rre s p o n d in g in o d e fro m d is k. Th e b lo ck d e vice d rive r p u t s t h e re q u e s t o n a q u e u e a n d t h e p ro ce s s is s u s p e n d e d u n t il t h e b lo ck s t o rin g t h e in o d e is t ra n s fe rre d . Ho we ve r, t h e b lo ck d e vice d rive r it s e lf ca n n o t b e b lo cke d b e ca u s e a n y o t h e r p ro ce s s t ryin g t o a cce s s t h e s a m e d is k wo u ld b e b lo cke d a s we ll. To ke e p t h e b lo ck d e vice d rive r fro m b e in g s u s p e n d e d , e a ch I/ O o p e ra t io n is p ro ce s s e d a s yn ch ro n o u s ly. Th u s , n o ke rn e l co n t ro l p a t h is fo rce d t o wa it u n t il a d a t a t ra n s fe r co m p le t e s . In p a rt icu la r, b lo ck d e vice d rive rs a re in t e rru p t - d rive n ( s e e S e ct io n 1 3 . 4 . 5 . 2 e a rlie r in t h is ch a p t e r) : a h ig h - le v e l d riv e r cre a t e s a n e w b lo ck d e vice re q u e s t o r e n la rg e s a n a lre a d y e xis t in g b lo ck d e vice re q u e s t a n d t h e n t e rm in a t e s . A lo w - le v e l d riv e r, wh ich is a ct iva t e d a t a la t e r t im e , in vo ke s a s o - ca lle d s t ra t e g y ro u t in e , wh ich t a ke s t h e re q u e s t fro m a q u e u e a n d s a t is fie s it b y is s u in g s u it a b le co m m a n d s t o t h e d is k co n t ro lle r. Wh e n t h e I/ O o p e ra t io n t e rm in a t e s , t h e d is k co n t ro lle r ra is e s a n in t e rru p t a n d t h e co rre s p o n d in g h a n d le r in vo ke s t h e s t ra t e g y ro u t in e a g a in , if n e ce s s a ry, t o p ro ce s s a n o t h e r re q u e s t in t h e q u e u e . Ea ch b lo ck d e vice d rive r m a in t a in s it s o wn re q u e s t q u e u e s ; t h e re s h o u ld b e o n e re q u e s t q u e u e fo r e a ch p h ys ica l b lo ck d e vice , s o t h a t t h e re q u e s t s ca n b e o rd e re d in s u ch a wa y a s t o in cre a s e d is k p e rfo rm a n ce . Th e s t ra t e g y ro u t in e ca n t h u s s e q u e n t ia lly s ca n t h e q u e u e a n d s e rvice a ll re q u e s t s wit h t h e m in im u m n u m b e r o f h e a d m o ve m e n t s .

13.4.5.1 Request descriptors Ea ch b lo ck d e vice re q u e s t is re p re s e n t e d b y a re q u e s t d e s crip t o r, wh ich is s t o re d in t h e

request d a t a s t ru ct u re illu s t ra t e d in Ta b le 1 3 - 7 . Th e d ire ct io n o f t h e d a t a t ra n s fe r is s t o re d in t h e cmd fie ld ; it is e it h e r READ ( fro m b lo ck d e vice t o RAM) o r WRITE ( fro m RAM t o b lo ck d e vice ) . Th e rq_status fie ld is u s e d t o s p e cify t h e s t a t u s o f t h e re q u e s t ; fo r m o s t b lo ck d e vice s , it is s im p ly s e t e it h e r t o RQ_INACTIVE ( fo r a re q u e s t d e s crip t o r n o t in u s e ) o r t o RQ_ACTIVE ( fo r a va lid re q u e s t t h a t is t o b e s e rvice d o r is a lre a d y b e in g s e rvice d b y t h e lo wle ve l d rive r) .

Ta b le 1 3 - 7 . Th e fie ld s o f a re q u e s t d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

struct list_head

queue

Po in t e rs fo r re q u e s t q u e u e lis t

int

elevator_sequence Th e "a g e " o f t h e re q u e s t fo r t h e e le va t o r a lg o rit h m

volatile int

rq_status

Re q u e s t s t a t u s

kdev_t

rq_dev

De vice id e n t ifie r

int

cmd

Re q u e s t e d o p e ra t io n

int

errors

S u cce s s o r fa ilu re co d e

unsigned long

sector

Firs t s e ct o r n u m b e r o n t h e ( virt u a l) b lo ck d e vice

unsigned long

nr_sectors

Nu m b e r o f s e ct o rs o f t h e re q u e s t o n t h e ( virt u a l) b lo ck d e vice

unsigned long

hard_sector

Firs t s e ct o r n u m b e r o f t h e ( re a l) b lo ck d e vice

unsigned long

hard_nr_sectors

Nu m b e r o f s e ct o rs o f t h e re q u e s t o n t h e ( re a l) b lo ck d e vice

unsigned int

nr_segments

Nu m b e r o f s e g m e n t s in t h e re q u e s t o n t h e ( virt u a l) b lo ck d e vice

unsigned int

nr_hw_segments

Nu m b e r o f s e g m e n t s in t h e re q u e s t o n t h e ( re a l) b lo ck d e vice

unsigned long

current_nr_sectors Nu m b e r o f s e ct o rs in t h e b lo ck cu rre n t ly t ra n s fe rre d

void *

special

Us e d o n ly b y d rive rs o f S CS I d e vice s

char *

buffer

Me m o ry a re a fo r I/ O t ra n s fe r

struct completion * waiting

Wa it q u e u e a s s o cia t e d wit h re q u e s t

struct buffer_head * bh

Firs t b u ffe r d e s crip t o r o f t h e re q u e s t

struct buffer_head * bhtail

La s t b u ffe r d e s crip t o r o f t h e re q u e s t

request_queue_t *

Po in t e r t o re q u e s t q u e u e d e s crip t o r

q

Th e re q u e s t m a y e n co m p a s s m a n y a d ja ce n t b lo cks o n t h e s a m e d e vice . Th e rq_dev fie ld id e n t ifie s t h e b lo ck d e vice , wh ile t h e sector fie ld s p e cifie s t h e n u m b e r o f t h e firs t s e ct o r o f t h e firs t b lo ck in t h e re q u e s t . Th e nr_sector fie ld s p e cifie s t h e n u m b e r o f s e ct o rs in t h e re q u e s t ye t t o b e t ra n s fe rre d . Th e current_nr_sector fie ld s t o re s t h e n u m b e r o f s e ct o rs in firs t b lo ck o f t h e re q u e s t . As we 'll la t e r s e e in S e ct io n 1 3 . 4 . 7 , t h e sector, nr_sector, a n d current_nr_sector fie ld s co u ld b e d yn a m ica lly u p d a t e d wh ile t h e re q u e s t is b e in g s e rvice d . Th e nr_segments fie ld s t o re t h e n u m b e r o f s e g m e n t s in t h e re q u e s t . Alt h o u g h a ll b lo cks in t h e re q u e s t s m u s t b e a d ja ce n t o n t h e b lo ck d e vice , t h e ir co rre s p o n d in g b u ffe rs a re n o t n e ce s s a rily co n t ig u o u s in RAM. A s e g m e n t is a s e q u e n ce o f a d ja ce n t b lo cks in t h e re q u e s t wh o s e co rre s p o n d in g b u ffe rs a re a ls o co n t ig u o u s in m e m o ry. Of co u rs e , a lo w- le ve l d e vice d rive r co u ld p ro g ra m t h e DMA co n t ro lle r s o a s t o t ra n s fe r a ll b lo cks in t h e s a m e s e g m e n t in a s in g le o p e ra t io n . Th e hard_sector, hard_nr_sectors, a n d nr_hw_segments fie ld s u s u a lly h a ve t h e s a m e va lu e a s t h e sector, nr_sectors, a n d nr_segments fie ld s , re s p e ct ive ly. Th e y d iffe r, h o we ve r, wh e n t h e re q u e s t re fe rs t o a d rive r t h a t h a n d le s s e ve ra l p h ys ica l b lo ck d e vice s a t o n ce . A t yp ica l e xa m p le o f s u ch a d rive r is t h e Lo g ica l Vo lu m e Ma n a g e r ( LVM) , wh ich is a b le t o h a n d le s e ve ra l d is ks a n d d is k p a rt it io n s a s a s in g le virt u a l d is k p a rt it io n . In t h is ca s e , t h e t wo s e rie s o f fie ld s d iffe r b e ca u s e t h e fo rm e r re fe rs t o t h e re a l p h ys ica l b lo ck d e vice , wh ile t h e la t t e r re fe rs t o t h e virt u a l d e vice . An o t h e r e xa m p le is s o ft wa re RAID, a d rive r t h a t d u p lica t e s d a t a o n s e ve ra l d is ks t o e n h a n ce re lia b ilit y. All b u ffe r h e a d s o f t h e b lo cks in t h e re q u e s t a re co lle ct e d in a s im p ly lin ke d lis t . Th e b_reqnext fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e n e xt e le m e n t in t h e lis t , wh ile t h e bh a n d

bhtail fie ld s o f t h e re q u e s t d e s crip t o r p o in t , re s p e ct ive ly, t o t h e firs t e le m e n t a n d t h e la s t e le m e n t in t h e lis t . Th e buffer fie ld o f t h e re q u e s t d e s crip t o r p o in t s t o t h e m e m o ry a re a u s e d fo r t h e a ct u a l d a t a t ra n s fe r. If t h e re q u e s t in vo lve s a s in g le b lo ck, buffer is ju s t a co p y o f t h e b_data

fie ld o f t h e b u ffe r h e a d . Ho we ve r, if t h e re q u e s t e n co m p a s s e s s e ve ra l b lo cks wh o s e b u ffe rs a re n o t co n s e cu t ive in m e m o ry, t h e b u ffe rs a re lin ke d t h ro u g h t h e b_reqnext fie ld s o f t h e ir b u ffe r h e a d s a s s h o wn in Fig u re 1 3 - 3 . On a re a d , t h e lo w- le ve l d rive r co u ld ch o o s e t o a llo ca t e a la rg e m e m o ry a re a re fe rre d b y buffer, re a d a ll s e ct o rs o f t h e re q u e s t a t o n ce , a n d t h e n co p y t h e d a t a in t o t h e va rio u s b u ffe rs . S im ila rly, fo r a writ e , t h e lo w- le ve l d e vice d rive r co u ld co p y t h e d a t a fro m m a n y n o n co n s e cu t ive b u ffe rs in t o a s in g le m e m o ry a re a re fe rre d b y buffer a n d t h e n p e rfo rm t h e wh o le d a t a t ra n s fe r a t o n ce . Fig u re 1 3 - 3 . A re q u e s t d e s c rip t o r a n d it s b u ffe rs a n d s e c t o rs

Fig u re 1 3 - 3 illu s t ra t e s a re q u e s t d e s crip t o r e n co m p a s s in g t h re e b lo cks . Th e b u ffe rs o f t wo o f t h e m a re co n s e cu t ive in RAM, wh ile t h e t h ird b u ffe r is b y it s e lf. Th e co rre s p o n d in g b u ffe r h e a d s id e n t ify t h e lo g ica l b lo cks o n t h e b lo ck d e vice ; t h e b lo cks m u s t n e ce s s a rily b e a d ja ce n t . Ea ch lo g ica l b lo ck in clu d e s t wo s e ct o rs . Th e sector fie ld o f t h e re q u e s t d e s crip t o r p o in t s t o t h e firs t s e ct o r o f t h e firs t b lo ck o n d is k, a n d t h e b_reqnext fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e n e xt b u ffe r h e a d . Du rin g t h e in it ia liza t io n p h a s e , e a ch b lo ck d e vice d rive r u s u a lly a llo ca t e s a fixe d n u m b e r o f re q u e s t d e s crip t o rs t o h a n d le it s fo rt h co m in g I/ O re q u e s t s . Th e blk_init_queue( ) fu n ct io n s e t s u p t wo e q u a lly s ize d lis t s o f fre e re q u e s t d e s crip t o rs : o n e fo r t h e READ o p e ra t io n a n d a n o t h e r fo r t h e WRITE o p e ra t io n s . Th e s ize o f t h e s e lis t s is s e t t o 6 4 if t h e RAM s ize is g re a t e r t h a n 3 2 MB, o r t o 3 2 if t h e RAM s ize is le s s t h a n o r e q u a l t o 3 2 MB. Th e s t a t u s o f a ll re q u e s t d e s crip t o rs is s e t in it ia lly t o RQ_INACTIVE.

Th e fixe d n u m b e r o f re q u e s t d e s crip t o rs m a y b e co m e , u n d e r ve ry h e a vy lo a d s a n d h ig h d is k a ct ivit y, a b o t t le n e ck. A d e a rt h o f fre e d e s crip t o rs m a y fo rce p ro ce s s e s t o wa it u n t il a n o n g o in g d a t a t ra n s fe r t e rm in a t e s . Th u s , a wa it q u e u e is u s e d t o q u e u e p ro ce s s e s wa it in g fo r a fre e request e le m e n t . Th e get_request_wait( ) t rie s t o g e t a fre e re q u e s t d e s crip t o r a n d p u t s t h e cu rre n t p ro ce s s t o s le e p in t h e wa it q u e u e if n o n e is fo u n d ; t h e get_request(

) fu n ct io n is s im ila r b u t s im p ly re t u rn s NULL if n o fre e re q u e s t d e s crip t o r is a va ila b le .

A t h re s h o ld va lu e kn o wn a s batch_requests ( s e t t o 3 2 o r t o 1 6 , d e p e n d in g o n t h e RAM s ize ) is u s e d t o cu t d o wn ke rn e l o ve rh e a d ; wh e n re le a s in g a re q u e s t d e s crip t o r, p ro ce s s e s wa it in g fo r fre e re q u e s t d e s crip t o rs a re n o t wo ke n u p u n le s s t h e re a re a t le a s t batch_requests fre e d e s crip t o rs . Co n ve rs e ly, wh e n lo o kin g fo r a fre e re q u e s t d e s crip t o r,

get_request_wait( ) re lin q u is h e s t h e CPU if t h e re a re fe we r t h a n batch_requests fre e d e s crip t o rs .

13.4.5.2 Request queue descriptors Re q u e s t q u e u e s a re re p re s e n t e d b y m e a n s o f re q u e s t q u e u e d e s crip t o rs ; e a ch o f t h e m is a request_queue_t d a t a s t ru ct u re wh o s e fie ld s a re lis t e d in Ta b le 1 3 - 8 .

Ta b le 1 3 - 8 . Th e fie ld s o f a re q u e s t q u e u e d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

struct request_list [] rq

READ a n d WRITE fre e lis t s o f re q u e s t s

struct list_head

queue_head

Lis t o f p e n d in g re q u e s t s

elevator_t

elevator

Me t h o d s o f t h e e le va t o r a lg o rit h m

request_fn_proc *

request_fn

S t ra t e g y ro u t in e o f t h e d rive r

merge_request_fn *

back_merge_fn

Me t h o d t o a p p e n d a b lo ck t o t h e re q u e s t

merge_request_fn *

front_merge_fn

Me t h o d t o in s e rt a b lo ck in fro n t o f t h e re q u e s t

merge_requests_fn *

merge_requests_fn e n la rg e d re q u e s t wit h t h e a d ja ce n t

Me t h o d t o a t t e m p t m e rg in g a n one s

make_request_fn *

make_request_fn

Me t h o d t h a t p a s s e s a re q u e s t t o a d rive r ( u s u a lly it in s e rt s t h e re q u e s t in t h e p ro p e r q u e u e )

plug_device_fn *

plug_device_fn

Me t h o d t o p lu g t h e d rive r

void *

queuedata

Priva t e d a t a o f t h e d e vice d rive r

struct tq_struct

plug_tq

Ta s k q u e u e e le m e n t fo r t h e p lu g g in g m e ch a n is m

char

plugged

Fla g d e n o t in g wh e t h e r t h e d rive r is cu rre n t ly p lu g g e d

char

head_active

Fla g d e n o t in g wh e t h e r t h e firs t re q u e s t in q u e u e is a ct ive wh e n t h e d rive r is u n p lu g g e d

spinlock_t

queue_lock

Re q u e s t q u e u e lo ck

wait_queue_head_t

wait_for_request Wa it q u e u e fo r la ck o f re q u e s t d e s crip t o rs

Wh e n t h e ke rn e l in it ia lize s a d e vice d rive r, it cre a t e s a n d fills a re q u e s t q u e u e d e s crip t o r fo r e a ch re q u e s t q u e u e h a n d le d b y t h e d rive r. Es s e n t ia lly, a re q u e s t q u e u e is a d o u b ly lin ke d lis t wh o s e e le m e n t s a re re q u e s t d e s crip t o rs ( t h a t is , request d a t a s t ru ct u re s ) . Th e queue_head fie ld o f t h e re q u e s t q u e u e d e s crip t o r s t o re s t h e h e a d ( t h e firs t d u m m y e le m e n t ) o f t h e lis t , wh ile t h e p o in t e rs in t h e queue fie ld o f t h e re q u e s t d e s crip t o r lin k a n y re q u e s t t o t h e p re vio u s a n d n e xt e le m e n t s in t h e lis t . Th e o rd e rin g o f t h e e le m e n t s in t h e q u e u e lis t is s p e cific t o e a ch b lo ck d e vice d rive r; t h e Lin u x ke rn e l o ffe rs , h o we ve r, t wo p re d e fin e d wa ys o f o rd e rin g e le m e n t s , wh ich a re d is cu s s e d in t h e la t e r s e ct io n S e ct io n 1 3 . 4 . 6 . 2 .

13.4.5.3 Block device low-level driver descriptor Ea ch b lo ck d e vice d rive r m a y d e fin e o n e o r m o re re q u e s t q u e u e s . To ke e p t ra ck o f t h e re q u e s t q u e u e s o f e a ch d rive r, a lo w - le v e l d riv e r d e s crip t o r is u s e d . Th e d e s crip t o r is a d a t a s t ru ct u re o f t yp e blk_dev_struct, wh o s e fie ld s a re lis t e d in Ta b le 1 3 - 9 . Th e d e s crip t o rs fo r a ll t h e b lo ck d e vice s a re s t o re d in t h e blk_dev t a b le , wh ich is in d e xe d b y t h e m a jo r n u m b e r o f t h e b lo ck d e vice .

Ta b le 1 3 - 9 . Th e fie ld s o f a b lo c k d e v ic e d riv e r d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

request_queue_t request_queue Co m m o n re q u e s t q u e u e ( fo r d rive rs t h a t d o n o t d e fin e p e r- d e vice q u e u e s )

queue_proc *

queue

Me t h o d re t u rn in g t h e a d d re s s o f a p e r- d e vice q u e u e

void *

data

Da t a ( e . g . , m in o r n u m b e r) u s e d b y queue

If t h e b lo ck d e vice d rive r h a s a u n iq u e re q u e s t q u e u e fo r a ll p h ys ica l b lo ck d e vice s , it s a d d re s s is s t o re d in t h e request_queue fie ld . Co n ve rs e ly, if t h e b lo ck d e vice d rive r

m a in t a in s s e ve ra l q u e u e s , t h e queue fie ld p o in t s t o a cu s t o m d rive r m e t h o d t h a t re ce ive s t h e id e n t ifie r o f t h e b lo ck d e vice file , s e le ct s o n e o f t h e q u e u e s a cco rd in g t o t h e va lu e o f t h e d a t a fie ld , t h e n re t u rn s t h e a d d re s s o f t h e p ro p e r re q u e s t q u e u e .

13.4.6 The ll_rw_block( ) Function Th e ll_rw_block( ) fu n ct io n cre a t e s a b lo ck d e vice re q u e s t . It is in vo ke d fro m s e ve ra l p la ce s in t h e ke rn e l t o t rig g e r t h e I/ O d a t a t ra n s fe r o f o n e o r m o re b lo cks . Th e fu n ct io n re ce ive s t h e fo llo win g p a ra m e t e rs : ●

Th e t yp e o f o p e ra t io n , rw, wh o s e va lu e ca n b e READ, WRITE, o r READA . Th e la s t



o p e ra t io n t yp e d iffe rs fro m t h e fo rm e r in t h a t t h e fu n ct io n d o e s n o t b lo ck wh e n a re q u e s t d e s crip t o r is n o t a va ila b le . Th e n u m b e r, nr, o f b lo cks t o b e t ra n s fe rre d .



A bhs a rra y o f nr p o in t e rs t o b u ffe r h e a d s d e s crib in g t h e b lo cks ( a ll o f t h e m m u s t h a ve t h e s a m e b lo ck s ize a n d m u s t re fe r t o t h e s a m e b lo ck d e vice ) .

Th e b u ffe r h e a d s we re p re vio u s ly in it ia lize d , s o e a ch s p e cifie s t h e b lo ck n u m b e r, t h e b lo ck s ize , a n d t h e virt u a l d e vice id e n t ifie r ( s e e t h e e a rlie r s e ct io n S e ct io n 1 3 . 4 . 4 ) . Th e fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : 1 . Ch e cks t h a t t h e b lo ck s ize b_size m a t ch e s t h e b lo ck s ize o f t h e virt u a l d e vice

b_dev fo r e a ch b u ffe r h e a d in t h e bhs a rra y. 2 . If t h e o p e ra t io n is WRITE, ch e cks t h a t t h e b lo ck d e vice is n o t re a d - o n ly.

3 . Fo r e a ch b u ffe r h e a d in t h e bhs a rra y, p e rfo rm s t h e fo llo win g s t e p s :

a . S e t s t h e BH_Lock fla g o f t h e b u ffe r h e a d . If it is a lre a d y s e t b y s o m e o t h e r ke rn e l t h re a d , it s kip s t h a t b u ffe r. b . In cre m e n t s t h e b_count fie ld o f t h e b u ffe r h e a d .

c. S e t s t h e b_end_io fie ld o f t h e b u ffe r h e a d t o end_buffer_io_sync( ); t h a t is , t o t h e fu n ct io n t h a t u p d a t e s t h e b u ffe r h e a d wh e n t h e d a t a t ra n s fe r is co m p le t e d ( s e e S e ct io n 1 3 . 4 . 7 la t e r in t h is ch a p t e r. ) d . If t h e b lo ck m u s t b e writ t e n , t e s t s t h e BH_Dirty fla g o f t h e b u ffe r h e a d in o n e o f t h e fo llo win g wa ys :



If BH_Dirty is re s e t , e xe cu t e s t h e b_end_io m e t h o d ( t h e

end_buffer_io_sync( ) fu n ct io n ) a n d co n t in u e s wit h t h e n e xt b u ffe r b e ca u s e t h e re is n o n e e d t o writ e t h is b lo ck.



If BH_Dirty is s e t , re s e t s it a n d p la ce s t h e b u ffe r h e a d in t h e lis t o f lo cke d b u ffe r h e a d s .

As a g e n e ra l ru le , t h e ca lle r o f ll_rw_block( ) m u s t s e t t h e BH_Dirty fla g fo r e a ch b lo ck t h a t is g o in g t o b e writ t e n . Th e re fo re , if ll_rw_block( ) fin d s t h a t t h e fla g is cle a r, t h e n t h e b lo ck is a lre a d y in vo lve d in a writ e o p e ra t io n , s o n o t h in g h a s t o b e d o n e . e . If t h e b lo ck m u s t b e re a d , t e s t s t h e BH_Uptodate fla g o f t h e b u ffe r h e a d . If it is s e t , e xe cu t e s t h e b_end_io m e t h o d ( t h e end_buffer_io_sync( ) fu n ct io n ) a n d co n t in u e s wit h t h e n e xt b u ffe r. Th e ke rn e l n e ve r re re a d s a b lo ck fro m d is k wh e n it s b u ffe r co n t a in s va lid ( u p - t o - d a t e ) d a t a . f. In vo ke s t h e submit_bh( ) fu n ct io n , wh ich :

1 . De t e rm in e s t h e n u m b e r o f t h e firs t b lo ck's s e ct o r o n t h e d is k—t h a t is , t h e va lu e o f t h e b_rsector fie ld —fro m t h e fie ld s b_blocknr ( t h e lo g ica l b lo ck n u m b e r) a n d b_size ( t h e b lo ck s ize ) . Th is fie ld co u ld b e la t e r m o d ifie d b y t h e b lo ck d e vice d rive r if it h a n d le s t h e Lo g ica l Vo lu m e Ma n a g e r ( LVM) o r a RAID d is k. 2 . S e t s t h e BH_Req fla g in b_state t o d e n o t e t h a t t h e b lo ck h a s b e e n re q u e s t e d . 3 . In it ia lize s t h e b_rdev fie ld fro m t h e b_dev fie ld . As b e fo re , t h is fie ld co u ld b e m o d ifie d la t e r b y t h e b lo ck d e vice d rive r if it h a n d le s t h e LVM o r a RAID d is k. 4 . In vo ke s generic_make_request( ).

Th e generic_make_request( ) fu n ct io n p o s t s t h e re q u e s t t o t h e lo w- le ve l d rive r. It re ce ive s a s p a ra m e t e rs t h e b u ffe r h e a d bh a n d t h e t yp e o f o p e ra t io n rw ( READ, WRITE, o r

READA) , a n d p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Ch e cks t h a t bh->b_rsector d o e s n o t e xce e d t h e n u m b e r o f s e ct o rs o f t h e b lo ck d e vice . If it d o e s , p rin t s a ke rn e l e rro r m e s s a g e , in vo ke s t h e b_end_io m e t h o d o f t h e b u ffe r h e a d , a n d t e rm in a t e s . 2 . Ext ra ct s t h e m a jo r n u m b e r maj o f t h e b lo ck d e vice d rive r fro m bh->b_rdev.

3 . Ge t s t h e d e s crip t o r o f t h e d e vice d rive r re q u e s t q u e u e fro m t h e lo w- le ve l d rive r d e s crip t o r blk_dev[maj]. To d o t h is , it in vo ke s t h e blk_dev[maj].queue m e t h o d , if it is d e fin e d ( t h e d rive r m a ke s u s e o f s e ve ra l q u e u e s ) ; o t h e rwis e , it re a d s t h e blk_dev[maj].request_queue fie ld ( t h e d rive r u s e s a s in g le q u e u e ) .

4 . In vo ke s t h e make_request_fn m e t h o d o f t h e re q u e s t q u e u e d e s crip t o r id e n t ifie d in t h e p re vio u s s t e p . In m o s t ca s e s , b lo ck d e vice d rive rs im p le m e n t t h e make_request_fn m e t h o d wit h t h e _

_make_request( ) fu n ct io n . It re ce ive s a s p a ra m e t e rs t h e q u e u e d e s crip t o r, t h e b u ffe r h e a d bh, a n d t h e t yp e o f o p e ra t io n rw, a n d p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Ch e cks wh e t h e r t h e t yp e o f t h e o p e ra t io n rw is READA; in t h is ca s e , it s e t s t h e lo ca l fla g rw_ahead t o 1 a n d s e t s rw t o READ.

2 . In vo ke s t h e create_bounce( ) fu n ct io n , wh ich lo o ks a t t h e PG_highmem fla g in

bh->b_page->flags a n d d e t e rm in e s wh e t h e r t h e bh->b_data b u ffe r is s t o re d in h ig h m e m o ry o r n o t ( s e e S e ct io n 7 . 1 . 6 ) . If t h e b u ffe r is in h ig h m e m o ry, t h e lo wle ve l d rive r m ig h t n o t b e a b le t o h a n d le it . Th e re fo re , create_bounce( ) t e m p o ra rily a llo ca t e s a n e w b u ffe r in lo w m e m o ry a n d a n e w b u ffe r h e a d p o in t in g t o it . Th e n e w b u ffe r h e a d is a lm o s t id e n t ica l t o bh, e xce p t fo r t h e b_data fie ld , wh ich p o in t s t o t h e n e w b u ffe r, t h e b_private fie ld , wh ich p o in t s t o t h e o rig in a l b u ffe r h e a d bh, a n d t h e b_end_io m e t h o d , wh ich p o in t s t o a cu s t o m m e t h o d t h a t re le a s e s t h e lo w- m e m o ry b u ffe r wh e n t h e I/ O o p e ra t io n t e rm in a t e s . If rw is WRITE, t h e n t h e lo w- m e m o ry b u ffe r is fille d wit h t h e h ig h m e m o ry b u ffe r co n t e n t s b y create_bounce( ); o t h e rwis e , if rw is READ, t h e lo w m e m o ry b u ffe r is co p ie d in t o t h e h ig h - m e m o ry b u ffe r b y t h e b_end_io cu s t o m m e t h o d .

3 . Ch e cks wh e t h e r t h e re q u e s t q u e u e is e m p t y:





If t h e re q u e s t q u e u e is e m p t y, in s e rt s a n e w re q u e s t d e s crip t o r in it a n d s ch e d u le s a ct iva t io n o f t h e s t ra t e g y ro u t in e o f t h e lo w- le ve l d rive r a t a la t e r t im e . If t h e re q u e s t q u e u e is n o t e m p t y, in s e rt s a n e w re q u e s t d e s crip t o r in it , t ryin g t o clu s t e r it wit h o t h e r re q u e s t s t h a t a re a lre a d y q u e u e d . As we 'll s e e s h o rt ly, t h e re is n o n e e d t o s ch e d u le t h e a ct iva t io n o f t h e s t ra t e g y ro u t in e .

Le t 's lo o k clo s e r a t t h e s e t wo ca s e s .

13.4.6.1 Scheduling the activation of the strategy routine As we s a w e a rlie r, it 's e xp e d ie n t t o d e la y a ct iva t io n o f t h e s t ra t e g y ro u t in e in o rd e r t o in cre a s e t h e ch a n ce s o f clu s t e rin g re q u e s t s fo r a d ja ce n t b lo cks . Th e d e la y is a cco m p lis h e d t h ro u g h a t e ch n iq u e kn o wn a s d e vice p lu g g in g a n d u n p lu g g in g . As lo n g a s a b lo ck d e vice d rive r is p lu g g e d , it s s t ra t e g y ro u t in e is n o t a ct iva t e d e ve n if t h e re a re re q u e s t s t o b e p ro ce s s e d in t h e d rive r's q u e u e s . If t h e re a l d e vice 's re q u e s t q u e u e is e m p t y a n d t h e d e vice is n 't a lre a d y p lu g g e d , _

_make_request( ) ca rrie s o u t a d e v ice p lu g g in g . Th e plug_device_fn m e t h o d t h a t p e rfo rm s t h is t a s k is u s u a lly im p le m e n t e d b y m e a n s o f t h e generic_plug_device( ) fu n ct io n . Th is fu n ct io n s e t s t h e plugged fie ld o f t h e re q u e s t q u e u e d e s crip t o r t o 1 a n d in s e rt s t h e plug _tq t a s k q u e u e e le m e n t ( s t a t ica lly in clu d e d in t h e re q u e s t q u e u e d e s crip t o r) in t h e tq_disk t a s k q u e u e ( s e e S e ct io n 4 . 7 . 3 . 1 ) t o ca u s e t h e d e vice 's s t ra t e g y ro u t in e t o b e a ct iva t e d la t e r. Th e _ _make_request( ) fu n ct io n t h e n a llo ca t e s a n e w re q u e s t d e s crip t o r b y in vo kin g

get_request( ). If n o re q u e s t d e s crip t o r is a va ila b le , t h e fu n ct io n ch e cks t h e va lu e o f t h e rw_ahead fla g . If it is s e t , t h e n t h e fu n ct io n is h a n d lin g a re la t ive ly u n im p o rt a n t re a d - a h e a d o p e ra t io n , t h u s it in vo ke s t h e b_end_io m e t h o d a n d t e rm in a t e s wit h o u t p e rfo rm in g t h e I/ O d a t a t ra n s fe r. Ot h e rwis e , t h e fu n ct io n in vo ke s t h e get_request_wait( ) fu n ct io n t o fo rce

t h e p ro ce s s t o s le e p u n t il a re q u e s t d e s crip t o r is fre e d . Ne xt , _ _make_request( ) in it ia lize s t h e n e w re q u e s t d e s crip t o r wit h t h e in fo rm a t io n re a d fro m t h e b u ffe r h e a d , in s e rt s it in t o t h e p ro p e r re a l d e vice 's re q u e s t q u e u e , a n d t e rm in a t e s . Ho w is t h e a ct u a l I/ O d a t a t ra n s fe r s t a rt e d ? Th e ke rn e l ch e cks p e rio d ica lly wh e t h e r t h e tq

_disk t a s k q u e u e co n t a in s a n y e le m e n t s . Th is o ccu rs in a ke rn e l t h re a d s u ch a s k s w a p d , o r wh e n t h e ke rn e l m u s t wa it fo r s o m e re s o u rce re la t e d t o b lo ck d e vice d rive rs , s u ch a s b u ffe rs o r re q u e s t d e s crip t o rs . Du rin g t h e tq _disk ch e ck, t h e ke rn e l re m o ve s a n y e le m e n t in t h e q u e u e a n d e xe cu t e s t h e co rre s p o n d in g fu n ct io n . Us u a lly, t h e fu n ct io n s t o re d in a n y plug_tq t a s k q u e u e p o in t s t o t h e generic_unplug

_device( ) fu n ct io n , wh ich re s e t s t h e plugged fie ld o f t h e re q u e s t q u e u e d e s crip t o r a n d in vo ke s it s request_fn m e t h o d , t h u s e xe cu t in g t h e lo w- le ve l d rive r's s t ra t e g y ro u t in e . Th is a ct ivit y is re fe rre d t o a s u n p lu g g in g t h e d e vice . As a re s u lt , t h e re q u e s t s in clu d e d in t h e q u e u e s o f t h e d rive r a re p ro ce s s e d , a n d t h e co rre s p o n d in g I/ O d a t a t ra n s fe rs t a ke p la ce .

13.4.6.2 Extending the request queue If t h e re q u e s t q u e u e is n o t e m p t y, t h e d rive r wa s a lre a d y p lu g g e d wh e n t h e ke rn e l in s e rt e d t h e firs t re q u e s t in t h e q u e u e . Th e re fo re , t h e re is n o n e e d t o s ch e d u le t h e a ct iva t io n o f t h e s t ra t e g y ro u t in e a g a in . Eit h e r t h e lo w- le ve l d rive r is a lre a d y u n p lu g g e d , o r it s o o n will b e . No t ice t h a t if _ _make_request( ) fin d s t h a t t h e q u e u e is n o t e m p t y, t h e lo w- le ve l d rive r co u ld b e a ct ive ly h a n d lin g t h e re q u e s t s o f t h e q u e u e . Ne ve rt h e le s s , t h e fu n ct io n ca n s a fe ly m o d ify t h e q u e u e b e ca u s e t h e lo w- le ve l d rive r u s u a lly re m o ve s t h e re q u e s t s fro m t h e q u e u e b e fo re p ro ce s s in g t h e m . As a p a rt icu la r ca s e , h o we ve r, t h e fu n ct io n n e ve r t o u ch e s t h e firs t re q u e s t in t h e q u e u e wh e n t h e head_active fie ld o f t h e re q u e s t q u e u e d e s crip t o r is s e t . Th is fla g is s e t wh e n t h e lo w- le ve l d rive r's p o licy is a lwa ys t o p ro ce s s t h e firs t re q u e s t in t h e q u e u e a n d n o t t o re m o ve t h e re q u e s t fro m t h e q u e u e u n t il t h e I/ O d a t a t ra n s fe r co m p le t e s . Th e _ _make_request( ) fu n ct io n m u s t e it h e r a d d a n e w e le m e n t in t h e q u e u e o r in clu d e t h e n e w b lo ck in a n e xis t in g re q u e s t ; t h e s e co n d ca s e is kn o wn a s b lo ck clu s t e rin g . Blo ck clu s t e rin g re q u ire s t h a t a ll t h e fo llo win g co n d it io n s b e s a t is fie d : ●



Th e b lo ck t o b e in s e rt e d b e lo n g s t o t h e s a m e b lo ck d e vice a s t h e o t h e r b lo cks in t h e re q u e s t a n d is a d ja ce n t t o t h e m : it e it h e r im m e d ia t e ly p re ce d e s t h e firs t b lo ck in t h e re q u e s t o r im m e d ia t e ly fo llo ws t h e la s t b lo ck in t h e re q u e s t . Th e b lo cks in t h e re q u e s t h a ve t h e s a m e I/ O o p e ra t io n t yp e ( READ o r WRITE) a s t h e



b lo ck t o b e in s e rt e d . Th e e xt e n d e d re q u e s t d o e s n o t e xce e d t h e a llo we d m a xim u m n u m b e r o f s e ct o rs . Th is va lu e is s t o re d in t h e max_sectors t a b le , wh ich is in d e xe d b y t h e m a jo r a n d



m in o r n u m b e r o f t h e b lo ck d e vice . Th e d e fa u lt va lu e is 2 5 5 s e ct o rs . Th e e xt e n d e d re q u e s t d o e s n o t e xce e d t h e a llo we d m a xim u m n u m b e r o f s e g m e n t s ( s e e S e ct io n 1 3 . 4 . 5 . 1 e a rlie r in t h is ch a p t e r) , wh ich is u s u a lly 1 2 8 .



No p ro ce s s is wa it in g fo r t h e co m p le t io n o f re q u e s t —i. e . , t h e waiting fie ld o f t h e re q u e s t d e s crip t o r is NULL.

Wh e n t h e _ _make_request( ) fu n ct io n m u s t d e t e rm in e h o w t o in s e rt t h e re q u e s t e d b lo ck

in t h e q u e u e , it u s e s a p ro g ra m t h a t is t ra d it io n a lly ca lle d a n e le v a t o r a lg o rit h m . Th e e le va t o r a lg o rit h m b a s ica lly d e fin e s t h e o rd e rin g o f t h e e le m e n t s in t h e q u e u e ; u s u a lly, t h is o rd e rin g is a ls o fo llo we d b y t h e lo w- le ve l d rive r wh e n it is h a n d lin g t h e re q u e s t s . Alt h o u g h e a ch b lo ck d e vice d rive r m a y d e fin e it s o wn e le va t o r a lg o rit h m , m o s t b lo ck d e vice d rive rs u s e e it h e r o n e o f t h e fo llo win g :

ELEVATOR_NOOP a lg o rit h m Ne w re q u e s t s d e s crip t o rs a re in s e rt e d a t t h e e n d o f t h e q u e u e . Th e re fo re , o ld e r re q u e s t s p re ce d e yo u n g e r o n e s . No t ice t h a t b lo ck clu s t e rin g e n la rg e s a re q u e s t b u t d o e s n 't m a ke it yo u n g e r. Th is a lg o rit h m fa vo rs fa irn e s s o f s e rvicin g t im e a m o n g t h e va rio u s re q u e s t s .

ELEVATOR_LINUS a lg o rit h m Th e q u e u e e le m e n t s t e n d t o b e o rd e re d b y t h e p o s it io n o f t h e co rre s p o n d in g s e ct o rs o n t h e b lo ck d e vice . Th is a lg o rit h m t rie s t o m in im ize t h e n u m b e r a n d e xt e n t o f s e e k o p e ra t io n s o n t h e p h ys ica l d e vice . Ho we ve r, t h e a lg o rit h m m u s t a ls o re ly o n a n a g e in g m e ch a n is m t o a vo id m a kin g re q u e s t s in t h e la s t p o s it io n s o f t h e q u e u e t o re m a in u n h a n d e d fo r lo n g p e rio d s o f t im e . Wh e n s e a rch in g fo r a re q u e s t t h a t m ig h t in clu d e t h e b lo ck, t h e a lg o rit h m s t a rt s fro m t h e b o t t o m o f t h e q u e u e a n d in t e rru p t s t h e s ca n n in g a s s o o n a s it fin d s a ve ry o ld re q u e s t . Th e e le va t o r a lg o rit h m is im p le m e n t e d b y t h re e m e t h o d s in clu d e d in t h e elevator fie ld o f t h e re q u e s t q u e u e d e s crip t o r:

elevator_merge_fn S ca n s t h e q u e u e a n d s e a rch e s a ca n d id a t e re q u e s t fo r t h e b lo ck clu s t e rin g . If b lo ck clu s t e rin g is n o t p o s s ib le , it a ls o re t u rn s t h e p o s it io n wh e re t h e n e w re q u e s t s h o u ld b e in s e rt e d . Ot h e rwis e , it re t u rn s t h e re q u e s t t h a t h a s t o b e e n la rg e d in o rd e r t o in clu d e t h e b lo cks in t h e n e w re q u e s t .

elevator_merge_cleanup_fn In vo ke d a ft e r a s u cce s s fu l b lo ck clu s t e rin g o p e ra t io n . It s h o u ld in cre a s e t h e a g e o f a ll re q u e s t s in t h e q u e u e t h a t fo llo w t h e e n la rg e d re q u e s t . ( Th e m e t h o d d o e s n o t h in g in t h e ELEVATOR_NOOP a lg o rit h m ) .

elevator_merge_req_fn In vo ke d wh e n t h e ke rn e l m e rg e s t wo e xis t in g re q u e s t s o f t h e q u e u e . It s h o u ld a s s ig n t h e a g e o f t h e n e w e n la rg e d re q u e s t . ( Th e m e t h o d d o e s n o t h in g in t h e ELEVATOR_NOOP a lg o rit h m ) .

To a d d a n e xis t in g re q u e s t t o t h e fro n t o r t h e b a ck o f a re q u e s t , t h e _ _make_request( ) fu n ct io n u s e s back_merge_fn a n d front_merge_fn m e t h o d s o f t h e re q u e s t q u e u e d e s crip t o r, re s p e ct ive ly. Aft e r a s u cce s s fu l b lo ck clu s t e rin g o p e ra t io n , _ _make_request(

) a ls o ch e cks wh e t h e r t h e e n la rg e d re q u e s t ca n b e m e rg e d wit h t h e p re vio u s o r t h e n e xt

re q u e s t in t h e q u e u e b y in vo kin g t h e merge_requests_fn m e t h o d o f t h e re q u e s t q u e u e d e s crip t o r.

13.4.7 Low-Level Request Handling We h a ve n o w re a ch e d t h e lo we s t le ve l in Lin u x's b lo ck d e vice h a n d lin g . Th is le ve l is im p le m e n t e d b y t h e s t ra t e g y ro u t in e , wh ich in t e ra ct s wit h t h e p h ys ica l b lo ck d e vice t o s a t is fy t h e re q u e s t s co lle ct e d in t h e q u e u e . As m e n t io n e d e a rlie r, t h e s t ra t e g y ro u t in e is u s u a lly s t a rt e d a ft e r in s e rt in g a n e w re q u e s t in a n e m p t y re q u e s t q u e u e . On ce a ct iva t e d , t h e lo w- le ve l d rive r s h o u ld h a n d le a ll re q u e s t s in t h e q u e u e a n d t e rm in a t e wh e n t h e q u e u e is e m p t y. A n a ïve im p le m e n t a t io n o f t h e s t ra t e g y ro u t in e co u ld b e t h e fo llo win g : fo r e a ch e le m e n t in t h e q u e u e , in t e ra ct wit h t h e b lo ck d e vice co n t ro lle r t o s e rvice t h e re q u e s t a n d wa it u n t il t h e d a t a t ra n s fe r co m p le t e s . Th e n re m o ve t h e s e rvice d re q u e s t fro m t h e q u e u e a n d p ro ce e d wit h t h e n e xt o n e . S u ch a n im p le m e n t a t io n is n o t ve ry e fficie n t . Eve n a s s u m in g t h a t d a t a ca n b e t ra n s fe rre d u s in g DMA, t h e s t ra t e g y ro u t in e m u s t s u s p e n d it s e lf wh ile wa it in g fo r I/ O co m p le t io n ; h e n ce , a n u n re la t e d u s e r p ro ce s s wo u ld b e h e a vily p e n a lize d . ( Th e s t ra t e g y ro u t in e d o e s n o t n e ce s s a rily e xe cu t e o n b e h a lf o f t h e p ro ce s s t h a t h a s re q u e s t e d t h e I/ O o p e ra t io n b u t a t a ra n d o m , la t e r t im e , s in ce it is a ct iva t e d b y m e a n s o f t h e tq _disk t a s k q u e u e . )

Th e re fo re , m a n y lo w- le ve l d rive rs a d o p t t h e fo llo win g s t ra t e g y: ●



Th e s t ra t e g y ro u t in e h a n d le s t h e firs t re q u e s t in t h e q u e u e a n d s e t s u p t h e b lo ck d e vice co n t ro lle r s o t h a t it ra is e s a n in t e rru p t wh e n t h e d a t a t ra n s fe r co m p le t e s . Th e n t h e s t ra t e g y ro u t in e t e rm in a t e s . Wh e n t h e b lo ck d e vice co n t ro lle r ra is e s t h e in t e rru p t , t h e in t e rru p t h a n d le r a ct iva t e s a b o t t o m h a lf. Th e b o t t o m h a lf h a n d le r re m o ve s t h e re q u e s t fro m t h e q u e u e a n d re e xe cu t e s t h e s t ra t e g y ro u t in e t o s e rvice t h e n e xt re q u e s t in t h e q u e u e .

Ba s ica lly, lo w- le ve l d rive rs ca n b e fu rt h e r cla s s ifie d in t o t h e fo llo win g : ● ●

Drive rs t h a t s e rvice e a ch b lo ck in a re q u e s t s e p a ra t e ly Drive rs t h a t s e rvice s e ve ra l b lo cks in a re q u e s t t o g e t h e r

Drive rs o f t h e s e co n d t yp e a re m u ch m o re co m p lica t e d t o d e s ig n a n d im p le m e n t t h a n d rive rs o f t h e firs t t yp e . In d e e d , a lt h o u g h t h e s e ct o rs a re a d ja ce n t o n t h e p h ys ica l b lo ck d e vice s , t h e b u ffe rs in RAM a re n o t n e ce s s a rily co n s e cu t ive . Th e re fo re , a n y s u ch d rive r m a y h a ve t o a llo ca t e a t e m p o ra ry a re a fo r t h e DMA d a t a t ra n s fe r, a n d t h e n p e rfo rm a m e m o ry- t o m e m o ry co p y o f t h e d a t a b e t we e n t h e t e m p o ra ry a re a a n d e a ch b u ffe r in t h e re q u e s t 's lis t . S in ce re q u e s t s h a n d le d b y b o t h t yp e s o f d rive rs co n s is t o f a d ja ce n t b lo cks , d is k p e rfo rm a n ce is e n h a n ce d b e ca u s e fe we r s e e k co m m a n d s a re is s u e d . Ho we ve r, t h e s e co n d t yp e o f d rive rs d o n o t fu rt h e r re d u ce t h e n u m b e r o f s e e k co m m a n d s , s o t ra n s fe rrin g s e ve ra l b lo cks fro m d is k t o g e t h e r is n o t a s e ffe ct ive in b o o s t in g d is k p e rfo rm a n ce . Th e ke rn e l d o e s n 't o ffe r a n y s u p p o rt fo r t h e s e co n d t yp e o f d rive rs : t h e y m u s t h a n d le t h e re q u e s t q u e u e s a n d t h e b u ffe r h e a d lis t s o n t h e ir o wn . Th e ch o ice t o le a ve t h e jo b u p t o t h e d rive r is n o t ca p ricio u s o r la zy. Ea ch p h ys ica l b lo ck d e vice is in h e re n t ly d iffe re n t fro m a ll o t h e rs ( fo r e xa m p le , a flo p p y d rive r g ro u p s b lo cks in d is k t ra cks a n d t ra n s fe rs a wh o le t ra ck

in a s in g le I/ O o p e ra t io n ) , s o m a kin g g e n e ra l a s s u m p t io n s o n h o w t o s e rvice e a ch clu s t e re d re q u e s t m a ke s ve ry lit t le s e n s e . Ho we ve r, t h e ke rn e l o ffe rs a lim it e d d e g re e o f s u p p o rt fo r t h e lo w- le ve l d rive rs in t h e firs t cla s s . We 'll s p e n d a lit t le m o re t im e d e s crib in g s u ch d rive rs . A t yp ica l s t ra t e g y ro u t in e s h o u ld p e rfo rm t h e fo llo win g a ct io n s : 1 . Ge t t h e cu rre n t re q u e s t fro m a re q u e s t q u e u e . If a ll re q u e s t q u e u e s a re e m p t y, t e rm in a t e t h e ro u t in e . 2 . Ch e ck t h a t t h e cu rre n t re q u e s t h a s co n s is t e n t in fo rm a t io n . In p a rt icu la r, co m p a re t h e m a jo r n u m b e r o f t h e b lo ck d e vice wit h t h e va lu e s t o re d in t h e rq _rdev fie ld o f t h e re q u e s t d e s crip t o r. Mo re o ve r, ch e ck t h a t t h e firs t b u ffe r h e a d in t h e lis t is lo cke d ( t h e BH_Lock fla g s h o u ld h a ve b e e n s e t b y ll_rw_block( )) .

3 . Pro g ra m t h e b lo ck d e vice co n t ro lle r fo r t h e d a t a t ra n s fe r o f t h e firs t b lo ck. Th e d a t a t ra n s fe r d ire ct io n ca n b e fo u n d in t h e cmd fie ld o f t h e re q u e s t d e s crip t o r a n d t h e a d d re s s o f t h e b u ffe r in t h e buffer fie ld , wh ile t h e in it ia l s e ct o r n u m b e r a n d t h e n u m b e r o f s e ct o rs t o b e t ra n s fe rre d a re s t o re d in t h e sector a n d

current_nr_sectors fie ld s , re s p e ct ive ly. [ 8 ] Als o , s e t u p t h e b lo ck d e vice co n t ro lle r s o t h a t a n in t e rru p t is ra is e d wh e n t h e DMA d a t a t ra n s fe r co m p le t e s . [8]

Re ca ll t h a t current_nr_sectors co n t a in s t h e n u m b e r o f s e ct o rs

in t h e firs t b lo ck o f t h e re q u e s t , wh ile nr_sectors co n t a in s t h e t o t a l n u m b e r o f s e ct o rs in t h e re q u e s t . 4 . If t h e ro u t in e is h a n d lin g a b lo ck d e vice file fo r wh ich ll_rw_block( ) a cco m p lis h e s b lo ck clu s t e rin g , in cre m e n t t h e sector fie ld a n d d e cre m e n t t h e

nr_sectors fie ld o f t h e re q u e s t d e s crip t o r t o ke e p t ra ck o f t h e b lo cks t o b e t ra n s fe rre d . Th e in t e rru p t h a n d le r a s s o cia t e d wit h t h e t e rm in a t io n o f t h e DMA d a t a t ra n s fe r fo r t h e b lo ck d e vice s h o u ld in vo ke ( e it h e r d ire ct ly o r via a b o t t o m h a lf) t h e end_request fu n ct io n ( o r a cu s t o m fu n ct io n o f t h e b lo ck d e vice d rive r t h a t d o e s t h e s a m e t h in g s ) . Th e fu n ct io n , wh ich re ce ive s a s p a ra m e t e rs t h e va lu e 1 if t h e d a t a t ra n s fe r s u cce e d s o r t h e va lu e 0 if a n e rro r o ccu rrs , p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . If a n e rro r o ccu rre d ( p a ra m e t e r va lu e is 0 ) , u p d a t e s t h e sector a n d nr_sectors fie ld s s o a s t o s kip t h e re m a in in g s e ct o rs o f t h e b lo ck. In S t e p 3 a , t h e b u ffe r co n t e n t is a ls o m a rke d a s n o t u p - t o - d a t e . 2 . Re m o ve s t h e b u ffe r h e a d o f t h e t ra n s fe rre d b lo ck fro m t h e re q u e s t 's lis t . 3 . In vo ke s t h e b_end_io m e t h o d o f t h e b u ffe r h e a d . Wh e n t h e ll_rw_block( ) fu n ct io n a llo ca t e s t h e b u ffe r h e a d , it lo a d s t h is fie ld wit h t h e a d d re s s o f t h e end_buffer_io_sync( ) fu n ct io n , wh ich e s s e n t ia lly p e rfo rm s t wo o p e ra t io n s :

a . S e t s t h e BH_Uptodate fla g o f t h e b u ffe r h e a d t o 1 o r 0 , a cco rd in g t o t h e

s u cce s s o r fa ilu re o f t h e d a t a t ra n s fe r b . Cle a rs t h e BH_Lock, BH_Wait_IO, a n d BH_launder fla g s o f t h e b u ffe r h e a d a n d wa ke s u p a ll p ro ce s s e s s le e p in g in t h e wa it q u e u e t o wh ich t h e b_wait fie ld o f t h e b u ffe r h e a d p o in t s Th e b_end_io fie ld co u ld a ls o p o in t t o o t h e r fu n ct io n s . Fo r in s t a n ce , if t h e

create_bounce( ) fu n ct io n cre a t e d a t e m p o ra ry b u ffe r in lo w m e m o ry, t h e b_end_io fie ld p o in t s t o a s u it a b le fu n ct io n t h a t u p d a t e s t h e o rig in a l b u ffe r in h ig h m e m o ry a n d t h e n in vo ke s t h e b_end_io m e t h o d o f t h e o rig in a l b u ffe r h e a d . 4 . If t h e re is a n o t h e r b u ffe r h e a d o n t h e re q u e s t 's lis t , s e t s t h e current_nr_sectors fie ld o f t h e re q u e s t d e s crip t o r t o t h e n u m b e r o f s e ct o rs o f t h e n e w b lo ck. 5 . S e t s t h e buffer fie ld wit h t h e a d d re s s o f t h e n e w b u ffe r ( fro m t h e b_data fie ld o f t h e n e w b u ffe r h e a d ) . 6 . Ot h e rwis e , if t h e re q u e s t 's lis t is e m p t y, a ll b lo cks h a ve b e e n p ro ce s s e d . Th e re fo re , it p e rfo rm s t h e fo llo win g o p e ra t io n s : a . Re m o ve s t h e re q u e s t d e s crip t o r fro m t h e re q u e s t q u e u e b . Wa ke s u p a n y p ro ce s s wa it in g fo r t h e re q u e s t t o co m p le t e ( waiting fie ld in t h e re q u e s t d e s crip t o r) c. S e t s t h e rq _status fie ld o f t h e re q u e s t t o RQ _INACTIVE

d . Pu t s t h e re q u e s t d e s crip t o r in t h e lis t o f fre e re q u e s t s Aft e r in vo kin g end_request, t h e lo w- le ve l d rive r ch e cks wh e t h e r t h e re q u e s t q u e u e is e m p t y. If it is n o t , t h e s t ra t e g y ro u t in e is e xe cu t e d a g a in . No t ice t h a t end_request a ct u a lly p e rfo rm s t wo n e s t e d it e ra t io n s : t h e o u t e r o n e o n t h e e le m e n t s o f t h e re q u e s t q u e u e a n d t h e in n e r o n e o n t h e e le m e n t s in t h e b u ffe r h e a d lis t o f e a ch re q u e s t . Th e s t ra t e g y ro u t in e is t h u s in vo ke d o n ce fo r e a ch b lo ck in t h e re q u e s t q u e u e .

13.4.8 Block and Page I/O Operations We 'll d is cu s s in t h e fo rt h co m in g ch a p t e rs h o w t h e ke rn e l u s e s t h e b lo ck d e vice d rive rs . We 'll s e e t h a t t h e re a re a n u m b e r o f ca s e s in wh ich t h e ke rn e l a ct iva t e s d is k I/ O d a t a t ra n s fe rs . Ho we ve r, le t 's d e s crib e h e re t h e t wo fu n d a m e n t a l kin d s o f I/ O d a t a t ra n s fe r fo r b lo ck d e vice s : Blo ck I/ O o p e ra t io n s He re t h e I/ O o p e ra t io n t ra n s fe rs a s in g le b lo ck o f d a t a , s o t h e t ra n s fe rre d d a t a ca n b e ke p t in a s in g le RAM b u ffe r. Th e d is k a d d re s s co n s is t s o f a d e vice n u m b e r a n d a b lo ck n u m b e r. Th e b u ffe r is a s s o cia t e d wit h a s p e cific d is k b lo ck, wh ich is id e n t ifie d b y t h e m a jo r a n d m in o r n u m b e rs o f t h e b lo ck d e vice a n d b y t h e lo g ica l b lo ck n u m b e r.

Pa g e I/ O o p e ra t io n s He re t h e I/ O o p e ra t io n t ra n s fe rs a s m a n y b lo cks o f d a t a a s n e e d e d t o fill a s in g le p a g e fra m e ( t h e e xa ct n u m b e r d e p e n d s b o t h o n t h e d is k b lo ck s ize a n d o n t h e p a g e fra m e s ize ) . If t h e s ize o f a p a g e fra m e is a m u lt ip le o f t h e b lo ck s ize , s e ve ra l d is k b lo cks a re t ra n s fe rre d in a s in g le I/ O o p e ra t io n . Ea ch p a g e fra m e co n t a in s d a t a b e lo n g in g t o a file . S in ce t h is d a t a is n o t n e ce s s a rily s t o re d in a d ja ce n t d is k b lo cks , it is id e n t ifie d b y t h e file 's in o d e a n d b y a n o ffs e t wit h in t h e file . Blo ck I/ O o p e ra t io n s a re m o s t o ft e n u s e d wh e n t h e ke rn e l re a d s o r writ e s s in g le b lo cks in a file s ys t e m ( fo r e xa m p le , a b lo ck co n t a in in g a n in o d e o r a s u p e rb lo ck) . Co n ve rs e ly, p a g e I/ O o p e ra t io n s a re u s e d m a in ly fo r re a d in g a n d writ in g file s ( b o t h re g u la r file s a n d b lo ck d e vice file s ) , fo r a cce s s in g file s t h ro u g h t h e m e m o ry m a p p in g , a n d fo r s wa p p in g . Bo t h kin d s o f I/ O o p e ra t io n s re ly o n t h e s a m e fu n ct io n s t o a cce s s a b lo ck d e vice , b u t t h e ke rn e l u s e s d iffe re n t a lg o rit h m s a n d b u ffe rin g t e ch n iq u e s wit h t h e m .

13.4.8.1 Block I/O operations Th e bread( ) fu n ct io n re a d s a s in g le b lo ck fro m a b lo ck d e vice a n d s t o re s it in a b u ffe r. It re ce ive s a s p a ra m e t e rs t h e d e vice id e n t ifie r, t h e b lo ck n u m b e r, a n d t h e b lo ck s ize , a n d re t u rn s a p o in t e r t o t h e b u ffe r h e a d o f t h e b u ffe r co n t a in in g t h e b lo ck. Th e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In vo ke s t h e getblk( ) fu n ct io n t o s e a rch fo r t h e b lo ck in a s o ft wa re ca ch e ca lle d t h e b u ffe r ca ch e ( s e e S e ct io n 1 4 . 2 . 4 ) . If t h e b lo ck is n o t in clu d e d in t h e ca ch e ,

getblk( ) a llo ca t e s a n e w b u ffe r fo r it . 2 . In vo ke s mark_page_accessed( ) o n t h e b u ffe r p a g e co n t a in in g t h e d a t a ( s e e S e ct io n 1 6 . 7 . 2 ) . 3 . If t h e b u ffe r a lre a d y co n t a in s va lid d a t a , it t e rm in a t e s . 4 . In vo ke s ll_rw_block( ) t o s t a rt t h e READ o p e ra t io n ( s e e S e ct io n 1 3 . 4 . 6 e a rlie r in t h is ch a p t e r) . 5 . Wa it s u n t il t h e d a t a t ra n s fe r co m p le t e s . Th is is d o n e b y in vo kin g a fu n ct io n n a m e d wait_on_buffer( ), wh ich in s e rt s t h e current p ro ce s s in t h e b_wait wa it q u e u e a n d s u s p e n d s t h e p ro ce s s u n t il t h e b u ffe r is u n lo cke d . 6 . Ch e cks wh e t h e r t h e b u ffe r co n t a in s va lid d a t a . If s o , it re t u rn s t h e a d d re s s o f t h e b u ffe r h e a d ; o t h e rwis e , it re t u rn s a NULL p o in t e r.

No fu n ct io n e xis t s t o d ire ct ly writ e a b lo ck t o d is k. De cla rin g a b u ffe r d irt y is s u fficie n t t o fo rce it s flu s h in g t o d is k a t s o m e la t e r t im e . In fa ct , writ e o p e ra t io n s a re n o t co n s id e re d crit ica l fo r s ys t e m p e rfo rm a n ce , s o t h e y a re d e fe rre d wh e n e ve r p o s s ib le ( s e e S e ct io n 14.2.4).

13.4.8.2 Page I/O operations

Blo ck d e vice s t ra n s fe r in fo rm a t io n o n e b lo ck a t a t im e , wh ile p ro ce s s a d d re s s s p a ce s ( o r t o b e m o re p re cis e , m e m o ry re g io n s a llo ca t e d t o t h e p ro ce s s ) a re d e fin e d a s s e t s o f p a g e s . Th is m is m a t ch ca n b e h id d e n t o s o m e e xt e n t b y u s in g p a g e I/ O o p e ra t io n s . Th e y m a y b e a ct iva t e d in t h e fo llo win g ca s e s : ●

A p ro ce s s is s u e s a read( ) o r write( ) s ys t e m ca ll o n a file ( s e e Ch a p t e r 1 5 ) .



A p ro ce s s re a d s a lo ca t io n o f a p a g e t h a t m a p s a file in m e m o ry ( s e e Ch a p t e r 1 5 ) . Th e ke rn e l flu s h e s s o m e d irt y p a g e s re la t e d t o a file m e m o ry m a p p in g t o d is k ( s e e S e ct io n 1 5 . 2 . 5 ) . Wh e n s wa p p in g in o r s wa p p in g o u t , t h e ke rn e l lo a d s fro m o r s a ve s t o d is k t h e co n t e n t s o f wh o le p a g e fra m e s ( s e e Ch a p t e r 1 6 ) .





Pa g e I/ O o p e ra t io n s ca n b e a ct iva t e d b y s e ve ra l ke rn e l fu n ct io n s . In t h is s e ct io n , we 'll p re s e n t t h e brw_page( ) fu n ct io n u s e d t o re a d o r writ e s wa p p a g e s ( s e e Ch a p t e r 1 6 ) . Ot h e r fu n ct io n s t h a t s t a rt p a g e I/ O o p e ra t io n s a re d is cu s s e d in Ch a p t e r 1 5 . Th e brw_page( ) fu n ct io n re ce ive s t h e fo llo win g p a ra m e t e rs :

rw Typ e o f I/ O o p e ra t io n ( READ, WRITE, o r READA)

page Ad d re s s o f a p a g e d e s crip t o r

dev Blo ck d e vice n u m b e r ( m a jo r a n d m in o r n u m b e rs )

b Arra y o f lo g ica l b lo ck n u m b e rs

size Blo ck s ize Th e p a g e d e s crip t o r re fe rs t o t h e p a g e in vo lve d in t h e p a g e I/ O o p e ra t io n . It m u s t a lre a d y b e lo cke d ( PG_locked fla g o n ) b e fo re in vo kin g brw_page( ) s o t h a t n o o t h e r ke rn e l co n t ro l p a t h ca n a cce s s it . Th e p a g e is co n s id e re d a s s p lit in t o 4 0 9 6 / size b u ffe rs ; t h e i t h b u ffe r in t h e p a g e is a s s o cia t e d wit h t h e b lo ck b[i] o f d e vice dev.

Th e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Ch e cks t h e page->buffers fie ld ; if it is NULL, in vo ke s create_empty_buffers(

) t o a llo ca t e t e m p o ra ry b u ffe r h e a d s fo r a ll b u ffe rs in clu d e d in t h e p a g e ( s u ch b u ffe r

h e a d s a re ca lle d a s yn ch ro n o u s ; t h e y a re d is cu s s e d in S e ct io n 1 4 . 2 . 1 ) . Th e a d d re s s o f t h e b u ffe r h e a d fo r t h e firs t b u ffe r in t h e p a g e is s t o re d in t h e page->buffers fie ld . Th e b_this_page fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e b u ffe r h e a d o f t h e n e xt b u ffe r in t h e p a g e . Co n ve rs e ly, if t h e page->buffers fie ld is n o t NULL, t h e ke rn e l d o e s n o t n e e d t o a llo ca t e t e m p o ra ry b u ffe r h e a d s . In fa ct , in t h is ca s e , t h e p a g e s t o re s s o m e b u ffe rs a lre a d y in clu d e d in t h e b u ffe r ca ch e , p re s u m a b ly b e ca u s e s o m e o f t h e m we re p re vio u s ly in vo lve d in b lo ck I/ O o p e ra t io n s ( s e e S e ct io n 1 4 . 2 . 2 fo r fu rt h e r d e t a ils ) . 2 . Fo r e a ch b u ffe r h e a d in t h e p a g e , p e rfo rm s t h e fo llo win g s u b s t e p s : a . S e t s t h e BH_Lock ( lo cks t h e b u ffe r fo r t h e I/ O d a t a t ra n s fe r) a n d t h e

BH_Mapped ( t h e b u ffe r m a p s a file o n d is k) fla g s o f t h e b u ffe r h e a d . b . S t o re s in t h e b_blocknr fie ld t h e va lu e o f t h e co rre s p o n d in g e le m e n t o f t h e a rra y b.

c. S in ce it is a n a s yn ch ro n o u s b u ffe r h e a d , s e t s t h e BH_Async fla g , a n d in t h e

b_end_io fie ld , s t o re s a p o in t e r t o end_buffer_io_async( ) ( d e s crib e d n e xt ) . 3 . Fo r e a ch b u ffe r h e a d in t h e p a g e , in vo ke s submit_bh( ) t o re q u e s t t h e b u ffe r ( s e e S e ct io n 1 3 . 4 . 6 e a rlie r in t h is ch a p t e r. ) Th e submit_bh( ) fu n ct io n a ct iva t e s t h e d e vice d rive r o f t h e b lo ck d e vice b e in g a cce s s e d . As d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 3 . 4 . 7 , t h e d e vice d rive r p e rfo rm s t h e a ct u a l d a t a t ra n s fe r a n d t h e n in vo ke s t h e b_end_io m e t h o d o f a ll a s yn ch ro n o u s b u ffe r h e a d s t h a t h a ve b e e n t ra n s fe rre d . Th e b_end_io fie ld p o in t s t o t h e end_buffer_io_async( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . S e t s t h e BH_Uptodate fla g o f t h e a s yn ch ro n o u s b u ffe r h e a d a cco rd in g t o t h e re s u lt o f t h e I/ O o p e ra t io n . 2 . If t h e BH_Uptodate fla g is o ff, s e t s t h e PG_error fla g o f t h e p a g e d e s crip t o r b e ca u s e a n e rro r o ccu rre d wh ile t ra n s fe rrin g t h e b lo ck. Th e fu n ct io n g e t s t h e p a g e d e s crip t o r a d d re s s fro m t h e b_page fie ld o f t h e b u ffe r h e a d .

3 . Ge t s t h e page_update_lock s p in lo ck.

4 . Cle a rs b o t h t h e BH_Async a n d t h e BH_Lock fla g s o f t h e b u ffe r h e a d , a n d a wa ke n s e a ch p ro ce s s wa it in g fo r t h e b u ffe r. 5 . If a n y o f t h e b u ffe r h e a d s in t h e p a g e a re s t ill lo cke d ( i. e . , t h e I/ O d a t a t ra n s fe r is n o t ye t t e rm in a t e d ) , re le a s e s t h e page_update_lock s p in lo ck a n d re t u rn s .

6 . Ot h e rwis e , re le a s e s t h e page_update_lock s p in lo ck a n d ch e cks t h e PG_error fla g o f t h e p a g e d e s crip t o r. If it is cle a re d , t h e n a ll d a t a t ra n s fe rs o n t h e p a g e h a ve

s u cce s s fu lly co m p le t e d , s o t h e fu n ct io n s e t s t h e PG_uptodate fla g o f t h e p a g e d e s crip t o r. 7 . Un lo cks t h e p a g e , cle a rs t h e PG_locked fla g , a n d wa ke s a n y p ro ce s s s le e p in g o n

page->wait wa it q u e u e . No t ice t h a t o n ce t h e p a g e I/ O o p e ra t io n t e rm in a t e s , t h e t e m p o ra ry b u ffe r h e a d s a llo ca t e d b y

create_empty_buffers( ) a re n o t a u t o m a t ica lly re le a s e d . As we s h a ll s e e in Ch a p t e r 1 6 , t h e t e m p o ra ry b u ffe r h e a d s a re re le a s e d o n ly wh e n t h e ke rn e l t rie s t o re cla im s o m e m e m o ry. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

13.5 Character Device Drivers Ha n d lin g a ch a ra ct e r d e vice is re la t ive ly e a s y, s in ce u s u a lly s o p h is t ica t e d b u ffe rin g s t ra t e g ie s a re n o t n e e d e d a n d d is k ca ch e s a re n o t in vo lve d . Of co u rs e , ch a ra ct e r d e vice s d iffe r in t h e ir re q u ire m e n t s : s o m e o f t h e m m u s t im p le m e n t a s o p h is t ica t e d co m m u n ica t io n p ro t o co l t o d rive t h e h a rd wa re d e vice , wh ile o t h e rs ju s t h a ve t o re a d a fe w va lu e s fro m a co u p le o f I/ O p o rt s o f t h e h a rd wa re d e vice s . Fo r in s t a n ce , t h e d e vice d rive r o f a m u lt ip o rt s e ria l ca rd d e vice ( a h a rd wa re d e vice o ffe rin g m a n y s e ria l p o rt s ) is m u ch m o re co m p lica t e d t h a n t h e d e vice d rive r o f a b u s m o u s e . A s m a ll co m p lica t io n , h o we ve r, co m e s fro m t h e fa ct t h a t t h e s a m e m a jo r n u m b e r m ig h t b e a llo ca t e d t o s e ve ra l d iffe re n t d e vice d rive rs . Fo r in s t a n ce , t h e m a jo r n u m b e r 1 0 is u s e d b y m a n y d iffe re n t d e vice d rive rs , s u ch a s a re a l- t im e clo ck a n d a PS / 2 m o u s e . To ke e p t ra ck o f wh ich ch a ra ct e r d e vice d rive rs a re cu rre n t ly in u s e , t h e ke rn e l u s e s a h a s h t a b le in d e xe d b y t h e m a jo r a n d m in o r n u m b e rs . [ 9 ] Th e h a s h t a b le a rra y is s t o re d in cdev_hashtable va ria b le ; it in clu d e s 6 4 lis t s o f ch a ra ct e r d e vice d e s crip t o rs . Ea ch d e s crip t o r is a char_device d a t a s t ru ct u re , wh o s e fie ld s a re s h o wn in Ta b le 1 3 - 1 0 .

[9]

A ch a ra ct e r d e vice d rive r re g is t e re d wit h t h e d e v fs d e vice file m ig h t n o t h a ve m a jo r a n d m in o r n u m b e rs . In t h is ca s e , t h e ke rn e l a s s u m e s t h a t it s m a jo r a n d m in o r n u m b e rs a re e q u a l t o ze ro .

Ta b le 1 3 - 1 0 . Th e fie ld s o f t h e c h a ra c t e r d e v ic e d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

struct list_head

hash

Po in t e rs fo r t h e h a s h t a b le lis t

atomic_t

count

Us a g e co u n t e r fo r t h e ch a ra ct e r d e vice d e s crip t o r

dev_t

dev

Ma jo r a n d m in o r n u m b e rs o f t h e ch a ra ct e r d e vice

atomic_t

openers No t u s e d

struct semaphore

sem

S e m a p h o re p ro t e ct in g t h e ch a ra ct e r d e vice

As fo r b lo ck d e vice d rive rs , a h a s h t a b le is re q u ire d b e ca u s e t h e ke rn e l ca n n o t d e t e rm in e wh e t h e r a ch a ra ct e r d e vice d rive r is in u s e b y s im p ly ch e ckin g wh e t h e r a ch a ra ct e r d e vice file h a s b e e n a lre a d y o p e n e d . In fa ct , t h e s ys t e m d ire ct o ry t re e m ig h t in clu d e s e ve ra l ch a ra ct e r d e vice file s h a vin g d iffe re n t p a t h n a m e s b u t e q u a l m a jo r a n d m in o r n u m b e rs , a n d t h e y a ll re fe r t o t h e ve ry s a m e d e vice d rive r.

A ch a ra ct e r d e vice d e s crip t o r is in s e rt e d in t o t h e h a s h t a b le wh e n e ve r a d e vice file re fe rrin g t o it is o p e n e d fo r t h e firs t t im e . Th is jo b is p e rfo rm e d b y t h e init_special_inode( ) fu n ct io n , wh ich is in vo ke d b y t h e lo w- le ve l file s ys t e m la ye r wh e n it d e t e rm in e s t h a t a d is k in o d e re p re s e n t s a d e vice file . init_special_inode( ) lo o ks u p t h e ch a ra ct e r d e vice d e s crip t o r in t h e h a s h t a b le ; if t h e d e s crip t o r is n o t fo u n d , t h e fu n ct io n a llo ca t e s a n e w d e s crip t o r a n d in s e rt s t h a t in t o t h e h a s h t a b le . Th e fu n ct io n a ls o s t o re s t h e d e s crip t o r a d d re s s in t o t h e i_cdev fie ld o f t h e in o d e o b je ct o f t h e d e vice file .

We m e n t io n e d in S e ct io n 1 3 . 2 . 3 t h a t t h e dentry_open( ) fu n ct io n t rig g e re d b y t h e open(

) s ys t e m ca ll s e rvice ro u t in e cu s t o m ize s t h e f_op fie ld in t h e file o b je ct o f t h e ch a ra ct e r d e vice file s o t h a t it p o in t s t o t h e def_chr_fops t a b le . Th is t a b le is a lm o s t e m p t y; it o n ly d e fin e s t h e chrdev_open( ) fu n ct io n a s t h e open m e t h o d o f t h e d e vice file . Th is m e t h o d is im m e d ia t e ly in vo ke d b y dentry_open( ). Th e chrdev_open( ) fu n ct io n re writ e s t h e f_op fie ld o f t h e file o b je ct wit h t h e a d d re s s s t o re d in t h e chrdevs t a b le e le m e n t t h a t co rre s p o n d s t o t h e m a jo r n u m b e r o f t h e ch a ra ct e r d e vice file . Th e n t h e fu n ct io n in vo ke s t h e open m e t h o d a g a in .

If t h e m a jo r n u m b e r is a s s ig n e d t o a u n iq u e d e vice d rive r, t h e m e t h o d in it ia lize s t h e d e vice d rive r. Ot h e rwis e , if t h e m a jo r n u m b e r is s h a re d a m o n g s e ve ra l d e vice d rive rs , t h e m e t h o d re writ e s o n ce m o re t h e f_op fie ld o f t h e file o b je ct wit h a n a d d re s s fo u n d in t h e d a t a s t ru ct u re in d e xe d b y t h e m in o r n u m b e r o f t h e d e vice file . Fo r in s t a n ce , t h e file_operations d a t a s t ru ct u re s fo r t h e d e vice file t h a t h a s t h e m a jo r n u m b e r 1 0 a re s t o re d in t h e s im p ly lin ke d lis t misc_list. Fin a lly, t h e open m e t h o d is in vo ke d fo r t h e la s t t im e t o in it ia lize t h e d e vice d rive r. On ce o p e n e d , t h e ch a ra ct e r d e vice file u s u a lly ca n b e a cce s s e d fo r re a d in g a n d / o r fo r writ in g ; t o d o t h is , t h e read a n d write m e t h o d s o f t h e file o b je ct p o in t s t o s u it a b le fu n ct io n s o f t h e d e vice d rive r. Mo s t d e vice d rive rs a ls o s u p p o rt t h e ioctl( ) s ys t e m ca ll t h ro u g h t h e ioctl file o b je ct m e t h o d ; it a llo ws s p e cia l co m m a n d s t o b e s e n t t o t h e u n d e rlyin g h a rd wa re d e vice . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 14. Disk Caches Th is ch a p t e r d e a ls wit h d is k ca ch e s . It s h o ws h o w Lin u x u s e s s o p h is t ica t e d t e ch n iq u e s t o im p ro ve s ys t e m p e rfo rm a n ce s b y re d u cin g d is k a cce s s e s a s m u ch a s p o s s ib le . As m e n t io n e d in S e ct io n 1 2 . 1 . 1 , a d is k ca ch e is a s o ft wa re m e ch a n is m t h a t a llo ws t h e s ys t e m t o ke e p in RAM s o m e d a t a t h a t is n o rm a lly s t o re d o n a d is k, s o t h a t fu rt h e r a cce s s e s t o t h a t d a t a ca n b e s a t is fie d q u ickly wit h o u t a cce s s in g t h e d is k. Be s id e s t h e d e n t ry ca ch e , wh ich is u s e d b y t h e VFS t o s p e e d u p t h e t ra n s la t io n o f a file p a t h n a m e t o t h e co rre s p o n d in g in o d e , t wo m a in d is k ca ch e s —t h e b u ffe r ca ch e a n d t h e p a g e ca ch e —a re u s e d b y Lin u x. As s u g g e s t e d b y it s n a m e , t h e b u ffe r ca ch e is a d is k ca ch e co n s is t in g o f b u ffe rs ; a s we kn o w fro m S e ct io n 1 3 . 4 . 3 , e a ch b u ffe r s t o re s a s in g le d is k b lo ck. Th e b lo ck I/ O o p e ra t io n s ( d e s crib e d in S e ct io n 1 3 . 4 . 8 . 1 in t h e s a m e ch a p t e r) re ly o n t h e b u ffe r ca ch e t o re d u ce t h e n u m b e r o f d is k a cce s s e s . Co n ve rs e ly, t h e p a g e ca ch e is a d is k ca ch e co n s is t in g o f p a g e s ; e a ch p a g e in t h e ca ch e co rre s p o n d s t o s e ve ra l b lo cks o f a re g u la r file o r a b lo ck d e vice file . Of co u rs e , t h e e xa ct n u m b e r o f b lo cks co n t a in e d in a p a g e d e p e n d s o n t h e s ize o f t h e b lo ck. All s u ch b lo cks a re lo g ica lly co n t ig u o u s — t h a t is , t h e y re p re s e n t a n in t e g ra l p o rt io n o f a re g u la r file o r o f a b lo ck d e vice file . To re d u ce t h e n u m b e r o f d is k a cce s s e s , b e fo re a ct iva t in g a p a g e I/ O o p e ra t io n ( d e s crib e d in S e ct io n 1 3 . 4 . 8 . 2 ) , t h e ke rn e l s h o u ld ch e ck wh e t h e r t h e wa n t e d d a t a is a lre a d y s t o re d in t h e p a g e ca ch e . Ta b le 1 4 - 1 s h o ws h o w s o m e wid e ly u s e d I/ O o p e ra t io n s u s e t h e b u ffe r a n d p a g e ca ch e s . S o m e o f t h e e xa m p le s g ive n re fe r t o t h e Ext 2 file s ys t e m , b u t t h e y ca n a p p ly t o a lm o s t a ll d is k- b a s e d file s ys t e m s .

Ta b le 1 4 - 1 . Us e o f t h e b u ffe r c a c h e a n d p a g e c a c h e

Ke rn e l fu n c t io n

S y s t e m c a ll

Ca c h e

I / O o p e ra t io n

bread( )

No n e

Bu ffe r

Re a d a n Ext 2 s u p e rb lo ck

bread( )

No n e

Bu ffe r

Re a d a n Ext 2 in o d e

generic_file_read( )

getdents( )

Pa g e

Re a d a n Ext 2 d ire ct o ry

generic_file_read( )

read( )

Pa g e

Re a d a n Ext 2 re g u la r file

generic_file_write( )

write( )

Pa g e

Writ e a n Ext 2 re g u la r file

generic_file_read( )

read( )

Pa g e

Re a d a b lo ck d e vice file

generic_file_write( )

write( )

Pa g e

Writ e a b lo ck d e vice file

filemap_nopage( )

No n e

Pa g e

Acce s s a m e m o ry- m a p p e d file

brw_page( )

No n e

Pa g e

Acce s s t o s wa p p e d - o u t p a g e

Ea ch o p e ra t io n in t h is t a b le a p p e a rs in s u b s e q u e n t ch a p t e rs : Re a d a n Ex t 2 s u p e rb lo ck S e e S e ct io n 1 3 . 4 . 8 . 1 . S e e a ls o Ch a p t e r 1 7 . Re a d a n Ex t 2 in o d e S e e S e ct io n 1 5 . 1 . 3 . S e e a ls o Ch a p t e r 1 7 fo r t h e Ext 2 file s ys t e m . Re a d a n Ex t 2 d ire ct o ry S e e S e ct io n 1 5 . 1 . 1 . S e e a ls o Ch a p t e r 1 7 fo r t h e Ext 2 file s ys t e m . Re a d a n Ex t 2 re g u la r file S e e S e ct io n 1 5 . 1 . 1 . W rit e a n Ex t 2 re g u la r file S e e S e ct io n 1 5 . 1 . 3 . S e e a ls o Ch a p t e r 1 7 fo r t h e Ext 2 file s ys t e m . Re a d a b lo ck d e v ice file S e e S e ct io n 1 5 . 1 . 1 . W rit e a b lo ck d e v ice file S e e S e ct io n 1 5 . 1 . 3 . Acce s s a m e m o ry - m a p p e d file S e e S e ct io n 1 5 . 2 . Acce s s t o s w a p p e d - o u t p a g e S e e S e ct io n 1 3 . 4 . 8 . 2 . S e e a ls o Ch a p t e r 1 6 .

Fo r e a ch t yp e o f I/ O a ct ivit y, t h e t a b le a ls o s h o ws t h e s ys t e m ca ll re q u ire d t o s t a rt it ( if a n y) a n d t h e m a in co rre s p o n d in g ke rn e l fu n ct io n t h a t h a n d le s it . Th e t a b le s h o ws t h a t a cce s s e s t o m e m o ry- m a p p e d file s a n d s wa p p e d - o u t p a g e s d o n o t re q u ire s ys t e m ca lls ; t h e y a re t ra n s p a re n t t o t h e p ro g ra m m e r. On ce a file m e m o ry m a p p in g is s e t u p a n d s wa p p in g is a ct iva t e d , t h e a p p lica t io n p ro g ra m ca n a cce s s t h e m a p p e d file o r t h e s wa p p e d - o u t p a g e a s if it we re p re s e n t in m e m o ry. It is t h e ke rn e l's re s p o n s ib ilit y t o d e la y t h e p ro ce s s u n t il t h e re q u ire d p a g e is lo ca t e d o n d is k a n d b ro u g h t in t o RAM. Yo u 'll a ls o n o t ice t h a t t h e s a m e ke rn e l fu n ct io n , n a m e ly generic_file_read( ), is u s e d t o re a d fro m b lo ck d e vice file s a n d fro m re g u la r file s . S im ila rly, generic_file_write( ) is u s e d t o writ e b o t h in t o b lo ck d e vice file s a n d in t o re g u la r file s . We d e s crib e t h e s e fu n ct io n s in Ch a p t e r 1 5 .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

14.1 The Page Cache To a vo id u n n e ce s s a ry d is k a cce s s e s , t h e ke rn e l n e ve r t rie s t o re a d a p a g e fro m d is k wit h o u t lo o kin g in t o t h e p a g e ca ch e a n d ve rifyin g t h a t it d o e s n o t a lre a d y in clu d e t h e re q u e s t e d d a t a . To t a ke t h e m a xim u m a d va n t a g e fro m t h e p a g e ca ch e , s e a rch in g in t o it s h o u ld b e a ve ry fa s t o p e ra t io n . Th e u n it o f in fo rm a t io n ke p t in t h e p a g e ca ch e is , o f co u rs e , a wh o le p a g e o f d a t a . A p a g e d o e s n o t n e ce s s a rily co n t a in p h ys ica lly a d ja ce n t d is k b lo cks , s o it ca n n o t b e id e n t ifie d b y a d e vice n u m b e r a n d a b lo ck n u m b e r. In s t e a d , a p a g e in t h e p a g e ca ch e is id e n t ifie d b y a n a d d re s s o f a s p e cia l d a t a s t ru ct u re n a m e d address_space, a n d b y a n o ffs e t wit h in t h e file ( o r wh a t e ve r) re fe re n ce d b y t h e address_space d a t a s t ru ct u re .

14.1.1 The address_space Object Ta b le 1 4 - 1 s u g g e s t s t h a t t h e Lin u x p a g e ca ch e s p e e d s u p s e ve ra l d iffe re n t kin d s o f I/ O o p e ra t io n s . In fa ct , t h e p a g e ca ch e m a y in clu d e t h e fo llo win g t yp e s o f p a g e s : ●

● ●





Pa g e s co n t a in in g d a t a o f re g u la r file s a n d d ire ct o rie s o f d is k- b a s e d file s ys t e m s ; in Ch a p t e r 1 5 , we d e s crib e h o w t h e ke rn e l h a n d le s re a d a n d writ e o p e ra t io n s o n t h e m . Pa g e s co n t a in in g d a t a o f a m e m o ry- m a p p e d file ; s e e Ch a p t e r 1 5 fo r d e t a ils . Pa g e s co n t a in in g d a t a d ire ct ly re a d fro m b lo ck d e vice file s ( s kip p in g t h e file s ys t e m la ye r) ; a s d is cu s s e d in Ch a p t e r 1 5 , t h e ke rn e l h a n d le s t h e m u s in g t h e s a m e s e t o f fu n ct io n s a s fo r p a g e s co n t a in in g d a t a o f re g u la r file s . Pa g e s co n t a in in g d a t a o f Us e r Mo d e p ro ce s s e s t h a t h a ve b e e n s wa p p e d o u t o n d is k. As we s h a ll s e e in Ch a p t e r 1 6 , t h e ke rn e l co u ld b e fo rce d t o ke e p in t h e p a g e ca ch e s o m e p a g e s wh o s e co n t e n t s h a ve b e e n a lre a d y writ t e n o n a s wa p a re a . Pa g e s b e lo n g in g t o a n In t e rp ro ce s s Co m m u n ica t io n ( IPC) s h a re d m e m o ry re g io n ; we d e s crib e IPC re s o u rce s in Ch a p t e r 1 9 .

S o fa r, s o g o o d , b u t h o w is t h e ke rn e l s u p p o s e d t o ke e p t ra ck o f h o w e ve ry p a g e in t h e p a g e ca ch e s h o u ld b e h a n d le d ? Fo r in s t a n ce , s u p p o s e t h e ke rn e l wis h e s t o u p d a t e t h e co n t e n t o f a p a g e in clu d e d in t h e p a g e ca ch e — re a d in g t h e p a g e co n t e n t s fro m a re g u la r file , fro m a d ire ct o ry, fro m a b lo ck d e vice file , o r fro m a s wa p a re a a re q u it e d iffe re n t o p e ra t io n s , a n d t h e ke rn e l m u s t e xe cu t e t h e p ro p e r o p e ra t io n a cco rd in g t o t h e t yp e o f p a g e . Th e ke y d a t a s t ru ct u re t h a t e s t a b lis h e s t h e re la t io n s h ip b e t we e n p a g e s a n d m e t h o d s t h a t o p e ra t e o n t h e p a g e s is t h e address_space o b je ct . Fo rm a lly, e a ch address_space o b je ct e s t a b lis h e s a lin k b e t we e n a g e n e ric ke rn e l o b je ct ( t h e s o - ca lle d o w n e r) a n d a s e t o f m e t h o d s t h a t o p e ra t e o n t h e p a g e s b e lo n g in g t o t h e o wn e r. As s t a t e d b e fo re , t h e p a g e ca ch e in clu d e s five kin d s o f p a g e s , s o a p a g e m a y b e lo n g t o five p o s s ib le kin d s o f o wn e rs . Fo r in s t a n ce , if a p a g e b e lo n g s t o a re g u la r file t h a t is s t o re d in a n Ext 2 file s ys t e m , t h e o wn e r o f t h e p a g e is a n in o d e o b je ct . Th e i_mapping fie ld o f t h is o b je ct p o in t s t o a n

address_space o b je ct . In t u rn , t h e address_space o b je ct d e fin e s a s e t o f m e t h o d s t h a t a llo w t h e ke rn e l t o a ct o n t h e p a g e s co n t a in in g t h e d a t a o f o u r re g u la r file . S p e cifica lly, t h e address_space o b je ct in clu d e s t h e fie ld s s h o wn in Ta b le 1 4 - 2 .

Ta b le 1 4 - 2 . Th e fie ld s o f t h e a d d re s s _ s p a c e o b je c t

Ty p e

Fie ld

D e s c rip t io n

struct list_head

clean_pages

Lis t o f o wn e r's cle a n p a g e s

struct list_head

dirty_pages

Lis t o f o wn e r's n o n lo cke d d irt y p a g e s

struct list_head

locked_pages Lis t o f o wn e r's lo cke d d irt y p a g e s

unsigned long

nrpages

To t a l n u m b e r o f o wn e r's p a g e s

a_ops

Me t h o d s t h a t o p e ra t e o n t h e o wn e r's pages

struct inode *

host

Po in t e r t o t h e o wn in g in o d e

struct vm_area_struct *

i_mmap

Lis t o f m e m o ry re g io n s fo r p riva t e m e m o ry m a p p in g

struct vm_area_struct *

i_mmap_shared Lis t o f m e m o ry re g io n s fo r s h a re d

struct address_space_operations *

m e m o ry m a p p in g

spinlock_t

i_shared_lock S p in lo ck fo r t h e lis t s o f m e m o ry re g io n s

int

gfp_mask

Me m o ry a llo ca t o r fla g s fo r t h e o wn e r's pages

Th e clean_pages, dirty_pages, a n d locked_pages fie ld s re p re s e n t t h e h e a d s o f t h re e lis t s o f p a g e d e s crip t o rs . To g e t h e r, t h e s e lis t s in clu d e a ll p a g e s t h a t b e lo n g t o t h e o wn e r o f t h e address_space o b je ct . We d is cu s s t h e ro le o f e a ch lis t in t h e n e xt s e ct io n . Th e

nrpages fie ld s t o re s t h e t o t a l n u m b e r o f p a g e s in s e rt e d in t h e t h re e lis t s . Alt h o u g h t h e o wn e r o f t h e address_space o b je ct co u ld b e a n y g e n e ric ke rn e l o b je ct , u s u a lly it is a VFS in o d e o b je ct . ( Aft e r a ll, t h e p a g e ca ch e wa s in t ro d u ce d t o s p e e d u p d is k a cce s s e s !) In t h is ca s e , t h e host fie ld p o in t s t o t h e in o d e t h a t o wn s t h e address_space o b je ct . Th e i_mmap, i_mmap_shared, i_shared_lock, a n d gfp_mask fie ld s a re u s e d wh e n e ve r t h e o wn e r o f t h e address_space o b je ct is a n in o d e o f a m e m o ry- m a p p e d file . We d is cu s s

t h e m in S e ct io n 1 5 . 2 . 1 . Th e m o s t im p o rt a n t fie ld o f t h e address_space o b je ct is a_ops, wh ich p o in t s t o a t a b le o f t yp e address_space_operations co n t a in in g t h e m e t h o d s t h a t d e fin e h o w t h e o wn e r's p a g e s a re h a n d le d . Th e s e m e t h o d s a re s h o wn in Ta b le 1 4 - 3 .

Ta b le 1 4 - 3 . Th e m e t h o d s o f t h e a d d re s s _ s p a c e o b je c t

Me t h o d

D e s c rip t io n

writepage

Writ e o p e ra t io n ( fro m t h e p a g e t o t h e o wn e r's d is k im a g e )

readpage

Re a d o p e ra t io n ( fro m t h e o wn e r's d is k im a g e t o t h e p a g e )

sync_page

S t a rt t h e I/ O d a t a t ra n s fe r o f a lre a d y s ch e d u le d o p e ra t io n s o n t h e p a g e

prepare_write Pre p a re t h e writ e o p e ra t io n ( u s e d b y d is k- b a s e d file s ys t e m s ) commit_write

Co m p le t e t h e writ e o p e ra t io n ( u s e d b y d is k- b a s e d file s ys t e m s )

bmap

Ge t a lo g ica l b lo ck n u m b e r fro m a file b lo ck in d e x

flushpage

Pre p a re t o d e le t e t h e p a g e fro m t h e o wn e r's d is k im a g e

releasepage

Us e d b y jo u rn a lin g file s ys t e m s t o p re p a re t h e re le a s e o f a p a g e

direct_IO

Dire ct I/ O t ra n s fe r o f t h e d a t a o f t h e p a g e

Th e m o s t im p o rt a n t m e t h o d s a re readpage, writepage, prepare_write, a n d

commit_write. We d is cu s s t h e m in Ch a p t e r 1 5 . In m o s t ca s e s , t h e m e t h o d s lin k t h e o wn e r in o d e o b je ct s wit h t h e lo w- le ve l d rive rs t h a t a cce s s t h e p h ys ica l d e vice s . Fo r in s t a n ce , t h e fu n ct io n t h a t im p le m e n t s t h e readpage m e t h o d fo r a n in o d e o f a re g u la r file "kn o ws " h o w t o lo ca t e t h e p o s it io n s o n t h e p h ys ica l d is k d e vice o f t h e b lo cks co rre s p o n d in g t o a n y p a g e o f t h e file . In t h is ch a p t e r, h o we ve r, we d o n 't h a ve t o d is cu s s t h e address_space m e t h o d s fu rt h e r.

14.1.2 Page Cache Data Structures Th e p a g e ca ch e u s e s t h e fo llo win g m a in d a t a s t ru ct u re s : A p a g e h a s h t a b le Th is le t s t h e ke rn e l q u ickly d e rive t h e p a g e d e s crip t o r a d d re s s fo r t h e p a g e

a s s o cia t e d wit h a s p e cifie d address_space o b je ct a n d a s p e cifie d o ffs e t ( p re s u m a b ly, a file o ffs e t ) Pa g e d e s crip t o r lis t s in t h e address_space o b je ct

Th is le t s t h e ke rn e l q u ickly re t rie ve a ll p a g e s in a g ive n s t a t e o wn e d b y a p a rt icu la r in o d e o b je ct ( o r o t h e r ke rn e l o b je ct ) re fe re n ce d b y a n address_space o b je ct

Ma n ip u la t io n o f t h e p a g e ca ch e in vo lve s a d d in g a n d re m o vin g e n t rie s fro m t h e s e d a t a s t ru ct u re s , a s we ll a s u p d a t in g t h e fie ld s in a ll o b je ct s t h a t re fe re n ce t h e ca ch e d p a g e s . Th e pagecache_lock s p in lo ck p ro t e ct s t h e p a g e ca ch e d a t a s t ru ct u re s a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s .

14.1.2.1 The page hash table Wh e n a p ro ce s s re a d s a la rg e file , t h e p a g e ca ch e m a y b e co m e fille d wit h p a g e s re la t e d t o t h a t file . In s u ch ca s e s , s ca n n in g a lo n g lis t o f p a g e d e s crip t o rs t o fin d t h e p a g e t h a t m a p s t h e re q u ire d file p o rt io n co u ld b e co m e a t im e - co n s u m in g o p e ra t io n . Fo r t h is re a s o n , Lin u x u s e s a h a s h t a b le o f p a g e d e s crip t o r p o in t e rs n a m e d page_hash_table. It s s ize d e p e n d s o n t h e a m o u n t o f a va ila b le RAM; fo r e xa m p le , fo r s ys t e m s h a vin g 1 2 8 MB o f RAM, page_hash_table is s t o re d in 3 2 p a g e fra m e s a n d in clu d e s 3 2 , 7 6 8 p a g e d e s crip t o r p o in t e rs . Th e page_hash m a cro u s e s t h e a d d re s s o f a n address_space o b je ct a n d a n o ffs e t va lu e t o d e rive t h e a d d re s s o f t h e co rre s p o n d in g e n t ry in t h e h a s h t a b le . As u s u a l, ch a in in g is in t ro d u ce d t o h a n d le e n t rie s t h a t ca u s e a co llis io n : t h e next_hash a n d pprev_hash fie ld s o f t h e p a g e d e s crip t o rs a re u s e d t o im p le m e n t d o u b ly circu la r lis t s o f e n t rie s t h a t h a ve t h e s a m e h a s h va lu e . Th e page_cache_size va ria b le s p e cifie s t h e n u m b e r o f p a g e d e s crip t o rs in clu d e d in t h e co llis io n lis t s o f t h e p a g e h a s h t a b le ( a n d t h e re fo re in t h e wh o le p a g e ca ch e ) . Th e add_page_to_hash_queue( ) a n d remove_page_from_hash_queue( ) fu n ct io n s a d d a n e le m e n t in t o t h e h a s h t a b le a n d re m o ve a n e le m e n t fro m it , re s p e ct ive ly.

14.1.2.2 The lists of page descriptors in the address_space object As we h a ve s e e n , t h e address_space o b je ct in clu d e s t h re e lis t s o f p a g e d e s crip t o rs , wh o s e h e a d s a re s t o re d in t h e clean_pages, dirty_pages, a n d locked_pages fie ld s . Th e lis t s a llo w t h e ke rn e l t o q u ickly fin d a ll p a g e s o f a file ( o r wh a t e ve r) in a s p e cific s t a t e :

clean_pages In clu d e s t h e p a g e s n o t lo cke d a n d n o t d irt y ( t h e PG_locked a n d PG_dirty fla g s in t h e p a g e d e s crip t o r a re e q u a l t o 0 ) . Th e PG_uptodate fla g in d ica t e s wh e t h e r t h e d a t a in t h e p a g e s is u p t o d a t e . Typ ica lly, a p a g e is n o t u p t o d a t e wh e n it s co n t e n t s h a ve ye t t o b e re a d fro m t h e co rre s p o n d in g im a g e o n d is k.

dirty_pages

In clu d e s t h e p a g e s t h a t co n t a in u p - t o - d a t e d a t a , b u t wh o s e im a g e o n d is k h a ve ye t t o b e u p d a t e d . Th e PG_uptodate a n d PG_dirty fla g s in t h e p a g e d e s crip t o r a re s e t , wh ile t h e PG_locked fla g is cle a r.

locked_pages In clu d e s t h e p a g e s wh o s e co n t e n t s a re b e in g t ra n s fe rre d t o o r fro m d is k, s o t h e p a g e s ca n n o t b e cu rre n t ly a cce s s e d . Th e PG_locked fla g is s e t .

Th e add_page_to_inode_queue( ) fu n ct io n is u s e d t o in s e rt a p a g e d e s crip t o r in t o t h e

clean_pages lis t o f a n address_space o b je ct . Co n ve rs e ly, t h e remove_page_from_inode_queue( ) is u s e d t o re m o ve a p a g e d e s crip t o r fro m t h e lis t t h a t is cu rre n t ly in clu d in g it . [ 1 ] Th e ke rn e l m o ve s a p a g e d e s crip t o r fro m a lis t t o a n o t h e r o n e wh e n e ve r t h e p a g e ch a n g e s it s s t a t e . [1]

Th e n a m e s o f t h e s e fu n ct io n s a re in h e rit e d fro m t h e o ld Ve rs io n 2 . 2 o f t h e ke rn e l.

14.1.2.3 Page descriptor fields related to the page cache Wh e n a p a g e is in clu d e d in t h e p a g e ca ch e , s o m e fie ld s o f t h e co rre s p o n d in g p a g e d e s crip t o r h a ve s p e cia l m e a n in g s :

list De p e n d in g o n t h e s t a t e o f t h e p a g e , in clu d e s p o in t e rs fo r t h e n e xt a n d p re vio u s e le m e n t s in t h e d o u b ly lin ke d lis t o f cle a n , d irt y, o r lo cke d p a g e s o f t h e address_space o b je ct .

mapping Po in t s t o t h e address_space o b je ct t o wh ich t h e p a g e b e lo n g s . If t h e p a g e d o e s n o t b e lo n g t o t h e p a g e ca ch e , t h is fie ld is NULL.

index Wh e n t h e p a g e 's o wn e r is a n in o d e o b je ct , s p e cifie s t h e p o s it io n o f t h e d a t a co n t a in e d in t h e p a g e wit h in t h e d is k im a g e . Th e va lu e is in p a g e - s ize u n it s .

next_hash Po in t s t o t h e n e xt co llid in g p a g e d e s crip t o r in t h e p a g e h a s h lis t .

pprev_hash Po in t s t o t h e next_hash fie ld o f t h e p re vio u s co llid in g p a g e d e s crip t o r in t h e p a g e h a s h lis t .

In a d d it io n , wh e n a p a g e is in s e rt e d in t o t h e p a g e ca ch e , t h e u s a g e co u n t e r ( count fie ld ) o f t h e co rre s p o n d in g p a g e d e s crip t o r is in cre m e n t e d . If t h e count fie ld is e xa ct ly 1 , t h e p a g e b e lo n g s t o t h e ca ch e b u t is n o t b e in g a cce s s e d b y a n y p ro ce s s ; it ca n t h u s b e re m o ve d fro m t h e p a g e ca ch e wh e n e ve r fre e m e m o ry b e co m e s s ca rce , a s d e s crib e d in Ch a p t e r 1 6 .

14.1.3 Page Cache Handling Functions Th e h ig h - le ve l fu n ct io n s t h a t u s e t h e p a g e ca ch e in vo lve fin d in g , a d d in g , a n d re m o vin g a page. Th e find_get_page m a cro re ce ive s a s p a ra m e t e rs t h e a d d re s s o f a n address_space o b je ct a n d a n o ffs e t va lu e . It u s e s t h e page_hash m a cro t o d e rive t h e a d d re s s o f t h e h a s h t a b le e n t ry co rre s p o n d in g t o t h e va lu e s o f t h e p a ra m e t e rs , a n d in vo ke s t h e _

_find_get_page( ) fu n ct io n t o s e a rch fo r t h e re q u e s t e d p a g e d e s crip t o r in t h e p ro p e r co llis io n lis t . In t u rn , _ _find_get_page( ) a cq u ire s t h e pagecache_lock s p in lo ck, s ca n s t h e lis t o f e n t rie s t h a t h a ve t h e s a m e h a s h va lu e , t h e n re le a s e s t h e s p in lo ck. If t h e p a g e is fo u n d , t h e fu n ct io n in cre m e n t s t h e count fie ld o f t h e co rre s p o n d in g p a g e d e s crip t o r a n d re t u rn s it s a d d re s s ; o t h e rwis e , it re t u rn s NULL.

Th e add_to_page_cache( ) fu n ct io n in s e rt s a n e w p a g e d e s crip t o r ( wh o s e a d d re s s is p a s s e d a s a p a ra m e t e r) in t h e p a g e ca ch e . Th is is a ch ie ve d b y p e rfo rm in g t h e fo llo win g o p e ra t io n s : 1 . Acq u ire s t h e pagecache_lock s p in lo ck.

2 . Cle a rs t h e PG_uptodate, PG_error, PG_dirty, PG_referenced, PG_arch_1, a n d PG_checked fla g s , a n d s e t s t h e PG_locked fla g o f t h e p a g e fra m e t o in d ica t e t h a t t h e p a g e is lo cke d a n d p re s e n t in t h e ca ch e , b u t n o t ye t fille d wit h d a t a . 3 . In cre m e n t s t h e count fie ld o f t h e p a g e d e s crip t o r.

4 . In it ia lize s t h e index fie ld o f t h e p a g e d e s crip t o r wit h a va lu e p a s s e d a s a p a ra m e t e r, wh ich s p e cifie s t h e p o s it io n o f t h e d a t a co n t a in e d in t h e p a g e wit h in t h e p a g e 's d is k im a g e . 5 . In vo ke s add_page_to_inode_queue( ) t o in s e rt t h e p a g e d e s crip t o r in t h e

clean_pages lis t o f a n address_space o b je ct , wh o s e a d d re s s is p a s s e d a s a p a ra m e t e r. 6 . In vo ke s add_page_to_hash_queue( ) t o in s e rt t h e p a g e d e s crip t o r in t h e h a s h t a b le , u s in g t h e address_space o b je ct a d d re s s a n d t h e va lu e o f p a g e 's index fie ld a s h a s h ke ys . 7 . Re le a s e s t h e pagecache_lock s p in lo ck.

8 . In vo ke s lru_cache_add( ) t o a d d t h e p a g e d e s crip t o r in t h e in a ct ive lis t ( s e e Ch a p t e r 1 6 ) .

Th e find_or_create_page( ) fu n ct io n is s im ila r t o find_get_page; h o we ve r, if t h e re q u e s t e d p a g e is n o t in t h e ca ch e , t h e fu n ct io n in vo ke s alloc_page( ) t o g e t a n e w p a g e fra m e , t h e n in vo ke s add_to_page_cache( ) t o in s e rt t h e p a g e d e s crip t o r in t h e p a g e ca ch e . Th e remove_inode_page( ) fu n ct io n re m o ve s a p a g e d e s crip t o r fro m t h e p a g e ca ch e . Th is is a ch ie ve d b y a cq u irin g t h e pagecache_lock s p in lo ck, in vo kin g

remove_page_from_inode_queue( ) a n d remove_page_from_hash_queue( ), a n d t h e n re le a s in g t h e s p in lo ck. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

14.2 The Buffer Cache Th e wh o le id e a b e h in d t h e b u ffe r ca ch e is t o re lie ve p ro ce s s e s fro m h a vin g t o wa it fo r re la t ive ly s lo w d is ks t o re t rie ve o r s t o re d a t a . Th u s , it wo u ld b e co u n t e rp ro d u ct ive t o writ e a lo t o f d a t a a t o n ce ; in s t e a d , d a t a s h o u ld b e writ t e n p ie ce m e a l a t re g u la r in t e rva ls s o t h a t I/ O o p e ra t io n s h a ve a m in im a l im p a ct o n t h e s p e e d o f t h e u s e r p ro ce s s e s a n d o n re s p o n s e t im e e xp e rie n ce d b y h u m a n u s e rs . Th e ke rn e l m a in t a in s a lo t o f in fo rm a t io n a b o u t e a ch b u ffe r t o h e lp it p a ce t h e writ e s , in clu d in g a "d irt y" b it t o in d ica t e t h e b u ffe r h a s b e e n ch a n g e d in m e m o ry a n d n e e d s t o b e writ t e n , a n d a t im e s t a m p t o in d ica t e h o w lo n g t h e b u ffe r s h o u ld b e ke p t in m e m o ry b e fo re b e in g flu s h e d t o d is k. In fo rm a t io n o n b u ffe rs is ke p t in b u ffe r h e a d s ( in t ro d u ce d in t h e p re vio u s ch a p t e r) , s o t h e s e d a t a s t ru ct u re s re q u ire m a in t e n a n ce a lo n g wit h t h e b u ffe rs o f u s e r d a t a t h e m s e lve s . Th e s ize o f t h e b u ffe r ca ch e m a y va ry. Pa g e fra m e s a re a llo ca t e d o n d e m a n d wh e n a n e w b u ffe r is re q u ire d a n d o n e is n o t a va ila b le . Wh e n fre e m e m o ry b e co m e s s ca rce , a s we s h a ll s e e in Ch a p t e r 1 6 , b u ffe rs a re re le a s e d a n d t h e co rre s p o n d in g p a g e fra m e s a re re cycle d . Th e b u ffe r ca ch e co n s is t s o f t wo kin d s o f d a t a s t ru ct u re s : ● ●

A s e t o f b u ffe r h e a d s d e s crib in g t h e b u ffe rs in t h e ca ch e A h a s h t a b le t o h e lp t h e ke rn e l q u ickly d e rive t h e b u ffe r h e a d t h a t d e s crib e s t h e b u ffe r a s s o cia t e d wit h a g ive n p a ir o f d e vice a n d b lo ck n u m b e rs

14.2.1 Buffer Head Data Structures As m e n t io n e d in S e ct io n 1 3 . 4 . 4 , e a ch b u ffe r h e a d is s t o re d in a d a t a s t ru ct u re o f t yp e

buffer_head. Th e s e d a t a s t ru ct u re s h a ve t h e ir o wn s la b a llo ca t o r ca ch e ca lle d bh_cachep, wh ich s h o u ld n o t b e co n fu s e d wit h t h e b u ffe r ca ch e it s e lf. Th e s la b a llo ca t o r ca ch e is a m e m o ry ca ch e ( s e e S e ct io n 3 . 2 . 2 ) fo r t h e b u ffe r h e a d o b je ct s , m e a n in g t h a t it h a s n o in t e ra ct io n wit h d is ks a n d is s im p ly a wa y o f m a n a g in g m e m o ry e fficie n t ly. In co n t ra s t , t h e b u ffe r ca ch e is a d is k ca ch e fo r t h e d a t a in t h e b u ffe rs . Ea ch b u ffe r u s e d b y a b lo ck d e vice d rive r m u s t h a ve a co rre s p o n d in g b u ffe r h e a d t h a t d e s crib e s t h e b u ffe r's cu rre n t s t a t u s . Th e co n ve rs e is n o t t ru e : a b u ffe r h e a d m a y b e u n u s e d , wh ich m e a n s it is n o t b o u n d t o a n y b u ffe r. Th e ke rn e l ke e p s a ce rt a in n u m b e r o f u n u s e d b u ffe r h e a d s t o a vo id t h e o ve rh e a d o f co n s t a n t ly a llo ca t in g a n d d e a llo ca t in g m e m o ry. In g e n e ra l, a b u ffe r h e a d m a y b e in a n y o n e o f t h e fo llo win g s t a t e s : Un u s e d b u ffe r h e a d Th e o b je ct is a va ila b le ; t h e va lu e s o f it s fie ld s a re m e a n in g le s s , e xce p t fo r t h e b_dev fie ld t h a t s t o re s t h e va lu e B_FREE ( 0xffff) .

Bu ffe r h e a d fo r a ca ch e d b u ffe r It s b_data fie ld p o in t s t o a b u ffe r s t o re d in t h e b u ffe r ca ch e . Th e b_dev fie ld id e n t ifie s a b lo ck d e vice , a n d t h e BH_Mapped fla g is s e t . Mo re o ve r, t h e b u ffe r co u ld b e o n e o f t h e fo llo win g : No t u p - t o - d a t e ( BH_Uptodate fla g is cle a r)

Th e d a t a in t h e b u ffe r is n o t va lid ( fo r in s t a n ce , t h e d a t a h a s ye t t o b e re a d fro m d is k) .

Dirt y ( BH_Dirty fla g s e t )

Th e d a t a in t h e b u ffe r h a s b e e n m o d ifie d , a n d t h e co rre s p o n d in g b lo ck o n d is k n e e d s t o b e u p d a t e d .

Lo cke d ( BH_Lock fla g s e t )

An I/ O d a t a t ra n s fe r o n t h e b u ffe r co n t e n t s is in p ro g re s s .

As y n ch ro n o u s b u ffe r h e a d It s b_data fie ld p o in t s t o a b u ffe r in s id e a p a g e t h a t is in vo lve d in a p a g e I/ O o p e ra t io n ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) ; in t h is ca s e , t h e BH_Async fla g is s e t . As s o o n a s t h e p a g e I/ O o p e ra t io n t e rm in a t e s , t h e BH_Async fla g is cle a re d , b u t t h e b u ffe r h e a d is n o t fre e d ; ra t h e r, it re m a in s a llo ca t e d a n d in s e rt e d in t o a s im p ly lin ke d circu la r lis t o f t h e p a g e ( s e e t h e la t e r s e ct io n S e ct io n 1 4 . 2 . 2 ) . Th u s , it ca n b e re u s e d wit h o u t t h e o ve rh e a d o f a lwa ys a llo ca t in g n e w o n e s . As yn ch ro n o u s b u ffe r h e a d s a re re le a s e d wh e n e ve r t h e ke rn e l t rie s t o re cla im s o m e m e m o ry ( s e e Ch a p t e r 1 6 ) . S t rict ly s p e a kin g , t h e b u ffe r ca ch e d a t a s t ru ct u re s in clu d e o n ly p o in t e rs t o b u ffe r h e a d s fo r a ca ch e d b u ffe r. Fo r t h e s a ke o f co m p le t e n e s s , we s h a ll e xa m in e t h e d a t a s t ru ct u re s a n d t h e m e t h o d s u s e d b y t h e ke rn e l t o h a n d le a ll kin d s o f b u ffe r h e a d s , n o t ju s t t h o s e in t h e b u ffe r ca ch e .

14.2.1.1 The list of unused buffer heads All u n u s e d b u ffe r h e a d s a re co lle ct e d in a s im p ly lin ke d lis t , wh o s e firs t e le m e n t is a d d re s s e d b y t h e

unused_list va ria b le . Ea ch b u ffe r h e a d s t o re s t h e a d d re s s o f t h e n e xt lis t e le m e n t in t h e b_next_free fie ld . Th e cu rre n t n u m b e r o f e le m e n t s in t h e lis t is s t o re d in t h e nr_unused_buffer_heads va ria b le . Th e unused_list_lock s p in lo ck p ro t e ct s t h e lis t a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e lis t o f u n u s e d b u ffe r h e a d s a ct s a s a p rim a ry m e m o ry ca ch e fo r t h e b u ffe r h e a d o b je ct s , wh ile t h e bh_cachep s la b a llo ca t o r ca ch e is a s e co n d a ry m e m o ry ca ch e . Wh e n a b u ffe r h e a d is n o lo n g e r n e e d e d , it is in s e rt e d in t o t h e lis t o f u n u s e d b u ffe r h e a d s . Bu ffe r h e a d s a re re le a s e d t o t h e s la b a llo ca t o r ( a p re lim in a ry s t e p t o le t t in g t h e ke rn e l fre e t h e m e m o ry a s s o cia t e d wit h t h e m a lt o g e t h e r) o n ly wh e n t h e n u m b e r o f lis t e le m e n t s e xce e d s MAX_UNUSED_BUFFERS ( u s u a lly 1 0 0 e le m e n t s ) . In o t h e r wo rd s , a b u ffe r h e a d in t h is lis t is co n s id e re d a n a llo ca t e d o b je ct b y t h e s la b a llo ca t o r a n d a n u n u s e d d a t a s t ru ct u re b y t h e b u ffe r ca ch e . A s u b s e t o f NR_RESERVED ( u s u a lly 8 0 ) e le m e n t s in t h e lis t is re s e rve d fo r p a g e I/ O o p e ra t io n s . Th is is d o n e t o p re ve n t n a s t y d e a d lo cks ca u s e d b y t h e la ck o f fre e b u ffe r h e a d s . As we s h a ll s e e in Ch a p t e r 1 6 , if fre e m e m o ry is s ca rce , t h e ke rn e l ca n t ry t o fre e a p a g e fra m e b y s wa p p in g o u t s o m e p a g e t o d is k. To d o t h is , it re q u ire s a t le a s t o n e a d d it io n a l b u ffe r h e a d t o p e rfo rm t h e p a g e I/ O file o p e ra t io n . If t h e s wa p p in g a lg o rit h m fa ils t o g e t a b u ffe r h e a d , it s im p ly ke e p s wa it in g a n d le t s writ e s t o file s p ro ce e d t o fre e u p b u ffe rs , s in ce a t le a s t NR_RESERVED b u ffe r h e a d s a re g o in g t o b e re le a s e d a s s o o n a s t h e o n g o in g file o p e ra t io n s t e rm in a t e . Th e get_unused_buffer_head( ) fu n ct io n is in vo ke d t o g e t a n e w b u ffe r h e a d . It e s s e n t ia lly p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Acq u ire s t h e unused_list_lock s p in lo ck.

2 . If t h e lis t o f u n u s e d b u ffe r h e a d s h a s m o re t h a n NR_RESERVED e le m e n t s , re m o ve s o n e o f t h e m fro m t h e lis t , re le a s e s t h e s p in lo ck, a n d re t u rn s t h e a d d re s s o f t h e b u ffe r h e a d . 3 . Ot h e rwis e , re le a s e s t h e s p in lo ck a n d in vo ke s kmem_cache_alloc( ) t o a llo ca t e a n e w b u ffe r h e a d fro m t h e bh_cachep s la b a llo ca t o r ca ch e wit h p rio rit y GFP_NOFS ( s e e S e ct io n 7 . 1 . 5 ) ; if t h e o p e ra t io n s u cce e d s , re t u rn s it s a d d re s s . 4 . No fre e m e m o ry is a va ila b le . If t h e b u ffe r h e a d h a s b e e n re q u e s t e d fo r a b lo ck I/ O o p e ra t io n , re t u rn s NULL ( fa ilu re ) .

5 . If t h is p o in t is re a ch e d , t h e b u ffe r h e a d h a s b e e n re q u e s t e d fo r a p a g e I/ O o p e ra t io n . If t h e lis t o f u n u s e d b u ffe r h e a d s is n o t e m p t y, a cq u ire s t h e unused_list_lock s p in lo ck, re m o ve s o n e e le m e n t fro m t h e lis t , re le a s e s t h e s p in lo ck, a n d re t u rn s t h e a d d re s s o f t h e b u ffe r h e a d . 6 . Ot h e rwis e ( if t h e lis t is e m p t y) , re t u rn s NULL ( fa ilu re ) .

Th e put_unused_buffer_head( ) fu n ct io n p e rfo rm s t h e re ve rs e o p e ra t io n , re le a s in g a b u ffe r h e a d . It in s e rt s t h e o b je ct in t h e lis t o f u n u s e d b u ffe r h e a d s if t h a t lis t h a s fe we r t h a n

MAX_UNUSED_BUFFERS e le m e n t s ; o t h e rwis e , it re le a s e s t h e o b je ct t o t h e s la b a llo ca t o r b y in vo kin g kmem_cache_free( ) o n t h e b u ffe r h e a d . 14.2.1.2 Lists of buffer heads for cached buffers Wh e n a b u ffe r b e lo n g s t o t h e b u ffe r ca ch e , t h e fla g s o f t h e co rre s p o n d in g b u ffe r h e a d d e s crib e it s cu rre n t s t a t u s ( s e e S e ct io n 1 3 . 4 . 4 ) . Fo r in s t a n ce , wh e n a b lo ck n o t p re s e n t in t h e ca ch e m u s t b e re a d fro m d is k, a n e w b u ffe r is a llo ca t e d a n d t h e BH_Uptodate fla g o f t h e b u ffe r h e a d is cle a re d b e ca u s e t h e b u ffe r's co n t e n t s a re m e a n in g le s s . Wh ile fillin g t h e b u ffe r b y re a d in g fro m d is k, t h e

BH_Lock fla g is s e t t o p ro t e ct t h e b u ffe r fro m b e in g re cla im e d . If t h e re a d o p e ra t io n t e rm in a t e s s u cce s s fu lly, t h e BH_Uptodate fla g is s e t a n d t h e BH_Lock fla g is cle a re d . If t h e b lo ck m u s t b e writ t e n t o d is k, t h e b u ffe r co n t e n t is m o d ifie d a n d t h e BH_Dirty fla g is s e t ; t h e fla g is cle a re d o n ly a ft e r t h e b u ffe r is s u cce s s fu lly writ t e n t o d is k. An y b u ffe r h e a d a s s o cia t e d wit h a u s e d b u ffe r is co n t a in e d in a d o u b ly lin ke d lis t , im p le m e n t e d b y m e a n s o f t h e b_next_free a n d b_prev_free fie ld s . Th e re a re t h re e d iffe re n t lis t s , id e n t ifie d b y a n in d e x d e fin e d a s a m a cro ( BUF_CLEAN, BUF_LOCKED, a n d BUF_DIRTY) . We 'll d e fin e t h e s e lis t s in a m o m e n t . Th e t h re e lis t s a re in t ro d u ce d t o s p e e d u p t h e fu n ct io n s t h a t flu s h d irt y b u ffe rs t o d is k ( s e e S e ct io n 1 4 . 2 . 4 la t e r in t h is ch a p t e r) . Fo r re a s o n s o f e fficie n cy, a b u ffe r h e a d is n o t m o ve d rig h t a wa y fro m o n e lis t t o a n o t h e r wh e n it ch a n g e s s t a t u s ; t h is m a ke s t h e fo llo win g d e s crip t io n a b it m u rky.

BUF_CLEAN Th is lis t co lle ct s b u ffe r h e a d s o f n o n d irt y b u ffe rs ( BH_Dirty fla g is o ff) . No t ice t h a t b u ffe rs in t h is lis t a re n o t n e ce s s a rily u p t o d a t e — t h a t is , t h e y d o n 't n e ce s s a rily co n t a in va lid d a t a . If t h e b u ffe r is n o t u p t o d a t e , it co u ld e ve n b e lo cke d ( BH_Lock is o n ) a n d s e le ct e d t o b e re a d fro m t h e p h ys ica l d e vice wh ile b e in g o n t h is lis t . Th e b u ffe r h e a d s in t h is lis t a re g u a ra n t e e d o n ly t o b e n o t d irt y; in o t h e r wo rd s , t h e co rre s p o n d in g b u ffe rs a re ig n o re d b y t h e fu n ct io n s t h a t flu s h d irt y b u ffe rs t o d is k.

BUF_DIRTY

Th is lis t m a in ly co lle ct s b u ffe r h e a d s o f d irt y b u ffe rs t h a t h a ve n o t b e e n s e le ct e d t o b e writ t e n in t o t h e p h ys ica l d e vice — t h a t is , d irt y b u ffe rs t h a t h a ve n o t ye t b e e n in clu d e d in a b lo ck re q u e s t fo r a b lo ck d e vice d rive r ( BH_Dirty is o n a n d BH_Lock is o ff) . Ho we ve r, t h is lis t co u ld a ls o in clu d e n o n d irt y b u ffe rs , s in ce in a fe w ca s e s , t h e BH_Dirty fla g o f a d irt y b u ffe r is cle a re d wit h o u t flu s h in g it t o d is k a n d wit h o u t re m o vin g t h e b u ffe r h e a d fro m t h e lis t ( fo r in s t a n ce , wh e n e ve r a flo p p y d is k is re m o ve d fro m it s d rive wit h o u t u n m o u n t in g —a n e ve n t t h a t m o s t like ly le a d s t o d a t a lo s s , o f co u rs e ) .

BUF_LOCKED Th is lis t m a in ly co lle ct s b u ffe r h e a d s o f b u ffe rs t h a t h a ve b e e n s e le ct e d t o b e re a d fro m o r writ t e n t o t h e b lo ck d e vice ( BH_Lock is o n ; BH_Dirty is cle a r b e ca u s e t h e add_request(

) fu n ct io n re s e t s it b e fo re in clu d in g t h e b u ffe r h e a d in a b lo ck re q u e s t ) . Ho we ve r, wh e n a b lo ck I/ O o p e ra t io n fo r a lo cke d b u ffe r is co m p le t e d , t h e lo w- le ve l b lo ck d e vice h a n d le r cle a rs t h e BH_Lock fla g wit h o u t re m o vin g t h e b u ffe r h e a d fro m t h e lis t ( s e e S e ct io n 1 3 . 4 . 7 ) . Th e b u ffe r h e a d s in t h is lis t a re g u a ra n t e e d o n ly t o b e n o t d irt y, o r d irt y b u t s e le ct e d t o b e writ t e n . Fo r a n y b u ffe r h e a d a s s o cia t e d wit h a u s e d b u ffe r, t h e b_list fie ld o f t h e b u ffe r h e a d s t o re s t h e in d e x o f t h e lis t co n t a in in g t h e b u ffe r. Th e lru_list a rra y [ 2 ] s t o re s t h e a d d re s s o f t h e firs t e le m e n t in e a ch lis t , t h e nr_buffers_type a rra y s t o re s t h e n u m b e r o f e le m e n t s in e a ch lis t , a n d t h e size_buffers_type a rra y s t o re s t h e t o t a l ca p a cit y o f t h e b u ffe rs in e a ch lis t ( in b yt e ) . Th e

lru_list_lock s p in lo ck p ro t e ct s t h e s e a rra ys fro m co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . [2]

Th e n a m e o f t h e a rra y d e rive s fro m t h e a b b re via t io n fo r Le a s t Re ce n t ly Us e d . In e a rlie r ve rs io n s o f Lin u x, t h e s e lis t s we re o rd e re d a cco rd in g t o t h e t im e e a ch b u ffe r wa s la s t a cce s s e d . Th e mark_buffer_dirty( ) a n d mark_buffer_clean( ) fu n ct io n s s e t a n d cle a r, re s p e ct ive ly, t h e BH_Dirty fla g o f a b u ffe r h e a d . To ke e p t h e n u m b e r o f d irt y b u ffe rs in t h e s ys t e m b o u n d e d ,

mark_buffer_dirty( ) in vo ke s t h e balance_dirty( ) fu n ct io n ( s e e t h e la t e r s e ct io n S e ct io n 1 4 . 2 . 4 ) . Bo t h fu n ct io n s a ls o in vo ke t h e refile_buffer( ) fu n ct io n , wh ich m o ve s t h e b u ffe r h e a d in t o t h e p ro p e r lis t a cco rd in g t o t h e va lu e o f t h e BH_Dirty a n d BH_Lock fla g s . Be s id e t h e BUF_DIRTY lis t , t h e ke rn e l m a n a g e s t wo d o u b ly lin ke d lis t s o f d irt y b u ffe rs fo r e ve ry in o d e o b je ct . Th e y a re u s e d wh e n e ve r t h e ke rn e l m u s t flu s h a ll d irt y b u ffe rs o f a g ive n file — fo r in s t a n ce , wh e n s e rvicin g t h e fsync( ) o r fdatasync( ) s e rvice ca lls ( s e e S e ct io n 1 4 . 2 . 4 . 3 la t e r in t h is ch a p t e r) . Th e firs t o f t h e t wo lis t s in clu d e s b u ffe rs co n t a in in g t h e file 's co n t ro l d a t a ( like t h e d is k in o d e it s e lf) , wh ile t h e o t h e r lis t in clu d e s b u ffe rs co n t a in in g t h e file 's d a t a . Th e h e a d s o f t h e s e lis t s a re s t o re d in t h e i_dirty_buffers a n d i_dirty_data_buffers fie ld s o f t h e in o d e o b je ct , re s p e ct ive ly. Th e

b_inode_buffers fie ld o f a n y b u ffe r h e a d s t o re s t h e p o in t e rs t o t h e n e xt a n d p re vio u s e le m e n t s o f t h e s e lis t s . Bo t h o f t h e m a re p ro t e ct e d b y t h e lru_list_lock s p in lo ck ju s t m e n t io n e d . Th e buffer_insert_inode_queue( ) a n d buffer_insert_inode_data_queue( ) fu n ct io n s a re u s e d , re s p e ct ive ly, t o in s e rt a b u ffe r h e a d in t h e i_dirty_buffers a n d i_dirty_data_buffers lis t s . Th e inode_remove_queue( ) fu n ct io n re m o ve s a b u ffe r h e a d fro m t h e lis t t h a t in clu d e s it . 14.2.1.3 The hash table of cached buffer heads Th e a d d re s s e s o f t h e b u ffe r h e a d s b e lo n g in g t o t h e b u ffe r ca ch e a re in s e rt e d in t o a h a s h t a b le .

Give n a d e vice id e n t ifie r a n d a b lo ck n u m b e r, t h e ke rn e l ca n u s e t h e h a s h t a b le t o q u ickly d e rive t h e a d d re s s o f t h e co rre s p o n d in g b u ffe r h e a d , if o n e e xis t s . Th e h a s h t a b le n o t ice a b ly im p ro ve s ke rn e l p e rfo rm a n ce b e ca u s e ch e cks o n b u ffe r h e a d s a re fre q u e n t . Be fo re s t a rt in g a b lo ck I/ O o p e ra t io n , t h e ke rn e l m u s t ch e ck wh e t h e r t h e re q u ire d b lo ck is a lre a d y in t h e b u ffe r ca ch e ; in t h is s it u a t io n , t h e h a s h t a b le le t s t h e ke rn e l a vo id a le n g t h y s e q u e n t ia l s ca n o f t h e lis t s o f ca ch e d b u ffe rs . Th e h a s h t a b le is s t o re d in t h e hash_table a rra y, wh ich is a llo ca t e d d u rin g s ys t e m in it ia liza t io n a n d wh o s e s ize d e p e n d s o n t h e a m o u n t o f RAM in s t a lle d o n t h e s ys t e m . Fo r e xa m p le , fo r s ys t e m s h a vin g 1 2 8 MB o f RAM, hash_table is s t o re d in 4 p a g e fra m e s a n d in clu d e s 4 , 0 9 6 b u ffe r h e a d p o in t e rs . As u s u a l, e n t rie s t h a t ca u s e a co llis io n a re ch a in e d in d o u b ly lin ke d lis t s im p le m e n t e d b y m e a n s o f t h e b_next a n d b_pprev fie ld s o f e a ch b u ffe r h e a d . Th e hash_table_lock re a d / writ e s p in lo ck p ro t e ct s t h e h a s h t a b le d a t a s t ru ct u re s fro m co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e get_hash_table( ) fu n ct io n re t rie ve s a b u ffe r h e a d fro m t h e h a s h t a b le . Th e b u ffe r h e a d t o b e lo ca t e d is id e n t ifie d b y t h re e p a ra m e t e rs : t h e d e vice n u m b e r, t h e b lo ck n u m b e r, a n d t h e s ize o f t h e co rre s p o n d in g d a t a b lo ck. Th e fu n ct io n h a s h e s t h e va lu e s o f t h e d e vice n u m b e r a n d t h e b lo ck n u m b e r, a n d lo o ks in t o t h e h a s h t a b le t o fin d t h e firs t e le m e n t in t h e co llis io n lis t ; t h e n it ch e cks t h e b_dev, b_blocknr, a n d b_size fie ld s o f e a ch e le m e n t in t h e lis t a n d re t u rn s t h e a d d re s s o f t h e re q u e s t e d b u ffe r h e a d . If t h e b u ffe r h e a d is n o t in t h e ca ch e , t h e fu n ct io n re t u rn s NULL.

14.2.1.4 Buffer usage counter Th e b_count fie ld o f t h e b u ffe r h e a d is a u s a g e co u n t e r fo r t h e co rre s p o n d in g b u ffe r. Th e co u n t e r is in cre m e n t e d rig h t b e fo re e a ch o p e ra t io n o n t h e b u ffe r a n d d e cre m e n t e d rig h t a ft e r. It a ct s m a in ly a s a s a fe t y lo ck, s in ce t h e ke rn e l n e ve r d e s t ro ys a b u ffe r ( o r it s co n t e n t s ) a s lo n g a s it h a s a n o n n u ll u s a g e co u n t e r. In s t e a d , t h e ca ch e d b u ffe rs a re e xa m in e d e it h e r p e rio d ica lly o r wh e n t h e fre e m e m o ry b e co m e s s ca rce , a n d o n ly t h o s e b u ffe rs t h a t h a ve n u ll co u n t e rs m a y b e d e s t ro ye d ( s e e Ch a p t e r 1 6 ) . In o t h e r wo rd s , a b u ffe r wit h a n u ll u s a g e co u n t e r m a y b e lo n g t o t h e b u ffe r ca ch e , b u t it ca n n o t b e d e t e rm in e d h o w lo n g t h e b u ffe r will s t a y in t h e ca ch e . Wh e n a ke rn e l co n t ro l p a t h wis h e s t o a cce s s a b u ffe r, it s h o u ld in cre m e n t t h e u s a g e co u n t e r firs t . Th is t a s k is p e rfo rm e d b y t h e getblk( ) fu n ct io n , wh ich is u s u a lly in vo ke d t o lo ca t e t h e b u ffe r, s o t h a t t h e in cre m e n t n e e d n o t b e d o n e e xp licit ly b y h ig h e r- le ve l fu n ct io n s . Wh e n a ke rn e l co n t ro l p a t h s t o p s a cce s s in g a b u ffe r, it m a y in vo ke e it h e r brelse( ) o r bforget( ) t o d e cre m e n t t h e co rre s p o n d in g u s a g e co u n t e r. Th e d iffe re n ce b e t we e n t h e s e t wo fu n ct io n s is t h a t bforget( ) a ls o m a rks t h e b u ffe r a s cle a n , t h u s fo rcin g t h e ke rn e l t o fo rg e t a n y ch a n g e in t h e b u ffe r t h a t h a s ye t t o b e writ t e n o n d is k.

14.2.2 Buffer Pages Alt h o u g h t h e p a g e ca ch e a n d t h e b u ffe r ca ch e a re d iffe re n t d is k ca ch e s , in Ve rs io n 2 . 4 o f Lin u x, t h e y a re s o m e wh a t in t e rt win e d . In fa ct , fo r re a s o n s o f e fficie n cy, b u ffe rs a re n o t a llo ca t e d a s s in g le m e m o ry o b je ct s ; in s t e a d , b u ffe rs a re s t o re d in d e d ica t e d p a g e s ca lle d b u ffe r p a g e s . All t h e b u ffe rs wit h in a s in g le b u ffe r p a g e m u s t h a ve t h e s a m e s ize ; h e n ce , o n t h e 8 0 x 8 6 a rch it e ct u re , a b u ffe r p a g e ca n in clu d e fro m o n e t o e ig h t b u ffe rs , d e p e n d in g o n t h e b lo ck s ize . A s t ro n g e r co n s t ra in t , h o we ve r, is t h a t a ll t h e b u ffe rs in a b u ffe r p a g e m u s t re fe r t o a d ja ce n t b lo cks o f t h e u n d e rlyin g b lo ck d e vice . Fo r in s t a n ce , s u p p o s e t h a t t h e ke rn e l wa n t s t o re a d a 1 , 0 2 4 - b yt e in o d e b lo ck o f a re g u la r file . In s t e a d o f a llo ca t in g a s in g le 1 , 0 2 4 - b yt e b u ffe r fo r t h e in o d e , t h e ke rn e l m u s t re s e rve a wh o le p a g e s t o rin g fo u r b u ffe rs ; t h e s e b u ffe rs will co n t a in t h e d a t a o f a g ro u p o f fo u r a d ja ce n t b lo cks o n t h e b lo ck d e vice , in clu d in g t h e re q u e s t e d in o d e b lo ck.

It is e a s y t o u n d e rs t a n d t h a t a b u ffe r p a g e ca n b e re g a rd e d in t wo d iffe re n t wa ys . On o n e h a n d , it is t h e "co n t a in e r" fo r s o m e b u ffe rs , wh ich ca n b e in d ivid u a lly a d d re s s e d b y m e a n s o f t h e b u ffe r ca ch e . On t h e o t h e r h a n d , e a ch b u ffe r p a g e co n t a in s a 4 , 0 9 6 - b yt e p o rt io n o f a b lo ck d e vice file , h e n ce , it ca n b e in clu d e d in t h e p a g e ca ch e . In o t h e r wo rd s , t h e p o rt io n o f RAM ca ch e d b y t h e b u ffe r ca ch e is a lwa ys a s u b s e t o f t h e p o rt io n o f RAM ca ch e d b y t h e p a g e ca ch e . Th e b e n e fit o f t h is m e ch a n is m co n s is t s o f d ra m a t ica lly re d u cin g t h e s yn ch ro n iza t io n p ro b le m s b e t we e n t h e b u ffe r ca ch e a n d t h e p a g e ca ch e . In t h e 2 . 2 ve rs io n o f t h e ke rn e l, t h e t wo d is k ca ch e s we re n o t in t e rt win e d . A g ive n p h ys ica l b lo ck co u ld h a ve t wo im a g e s in RAM: o n e in t h e p a g e ca ch e a n d t h e o t h e r in t h e b u ffe r ca ch e . To a vo id d a t a lo s s , wh e n e ve r o n e o f t h e t wo b lo ck's m e m o ry im a g e s is m o d ifie d , t h e 2 . 2 ke rn e l m u s t a ls o fin d a n d u p d a t e t h e o t h e r m e m o ry im a g e . As yo u m ig h t im a g in e , t h is is a co s t ly o p e ra t io n . By wa y o f co n t ra s t , in Lin u x 2 . 4 , m o d ifyin g a b u ffe r im p lie s m o d ifyin g t h e p a g e t h a t co n t a in s it , a n d vice ve rs a . Th e ke rn e l m u s t o n ly p a y a t t e n t io n t o t h e "d irt y" fla g s o f b o t h t h e b u ffe r h e a d s a n d t h e p a g e d e s crip t o rs . Fo r in s t a n ce , wh e n e ve r a b u ffe r h e a d is m a rke d a s "d irt y, " t h e ke rn e l m u s t a ls o s e t t h e PG_dirty fla g o f t h e p a g e t h a t co n t a in s t h e co rre s p o n d in g b u ffe r.

Bu ffe r h e a d s a n d p a g e d e s crip t o rs in clu d e a fe w fie ld s t h a t d e fin e t h e lin k b e t we e n a b u ffe r p a g e a n d t h e co rre s p o n d in g b u ffe rs . If a p a g e a ct s a s a b u ffe r p a g e , t h e buffers fie ld o f it s p a g e d e s crip t o r p o in t s t o t h e b u ffe r h e a d o f t h e firs t b u ffe r in clu d e d in t h e p a g e ; o t h e rwis e t h e buffers fie ld is NULL. In t u rn , t h e b_this_page fie ld o f e a ch b u ffe r h e a d im p le m e n t s a s im p ly lin ke d circu la r lis t t h a t in clu d e s a ll b u ffe r h e a d s o f t h e b u ffe rs s t o re d in t h e b u ffe r p a g e . Mo re o ve r, t h e b_page fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e p a g e d e s crip t o r o f t h e co rre s p o n d in g b u ffe r p a g e . Fig u re 1 4 - 1 s h o ws a b u ffe r p a g e co n t a in in g fo u r b u ffe rs a n d t h e co rre s p o n d in g b u ffe r h e a d s . Fig u re 1 4 - 1 . A b u ffe r p a g e in c lu d in g fo u r b u ffe rs a n d t h e ir b u ffe r h e a d s

Th e re is a s p e cia l ca s e : if a p a g e h a s b e e n in vo lve d in a p a g e I/ O o p e ra t io n ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) , t h e ke rn e l m ig h t h a ve a llo ca t e d s o m e a s yn ch ro n o u s b u ffe r h e a d s a n d lin ke d t h e m t o t h e p a g e b y m e a n s o f t h e buffers a n d b_this_page fie ld s . Th u s , a p a g e co u ld a ct a s a b u ffe r p a g e u n d e r s o m e circu m s t a n ce s , e ve n t h o u g h t h e co rre s p o n d in g b u ffe r h e a d s a re n o t in t h e b u ffe r ca ch e .

14.2.2.1 Allocating buffer pages Th e ke rn e l a llo ca t e s a n e w b u ffe r p a g e wh e n it d is co ve rs t h a t t h e b u ffe r ca ch e d o e s n o t in clu d e d a t a fo r a g ive n b lo ck. In t h is ca s e , t h e ke rn e l in vo ke s t h e grow_buffers( ) fu n ct io n , p a s s in g t o it t h re e p a ra m e t e rs t h a t id e n t ify t h e b lo ck: ●

Th e b lo ck d e vice n u m b e r — t h e m a jo r a n d m in o r n u m b e rs o f t h e d e vice

● ●

Th e lo g ica l b lo ck n u m b e r — t h e p o s it io n o f t h e b lo ck in s id e t h e b lo ck d e vice Th e b lo ck s ize

Th e fu n ct io n e s s e n t ia lly p e rfo rm s t h e fo llo win g a ct io n s : 1 . Co m p u t e s t h e o ffs e t index o f t h e p a g e o f d a t a wit h in t h e b lo ck d e vice t h a t in clu d e s t h e re q u e s t e d b lo ck. 2 . Ge t s t h e a d d re s s bdev o f t h e b lo ck d e vice d e s crip t o r ( s e e S e ct io n 1 3 . 4 . 1 ) .

3 . In vo ke s grow_dev_page( ) t o cre a t e a n e w b u ffe r p a g e , if n e ce s s a ry. In t u rn , t h is fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s : a . In vo ke s find_or_create_page( ), p a s s in g t o it t h e address_space o b je ct o f t h e b lo ck d e vice ( bdev->bd_inode->i_mapping) a n d t h e p a g e o ffs e t index. As d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 4 . 1 . 3 , find_or_create_page( ) lo o ks fo r t h e p a g e in t h e p a g e ca ch e a n d , if n e ce s s a ry, in s e rt s a n e w p a g e in t h e ca ch e . b . No w t h e p a g e ca ch e is kn o wn t o in clu d e a d e s crip t o r fo r o u r p a g e . Th e fu n ct io n ch e cks it s buffers fie ld ; if it is NULL, t h e p a g e h a s n o t ye t b e e n fille d wit h b u ffe rs a n d t h e fu n ct io n ju m p s t o S t e p 3 e . c. Ch e cks wh e t h e r t h e s ize o f t h e b u ffe rs o n t h e p a g e is e q u a l t o t h e s ize o f t h e re q u e s t e d b lo ck; if s o , re t u rn s t h e a d d re s s o f t h e p a g e d e s crip t o r ( t h e p a g e fo u n d in t h e p a g e ca ch e is a va lid b u ffe r p a g e ) . d . Ot h e rwis e , ch e cks wh e t h e r t h e b u ffe rs fo u n d in t h e p a g e ca n b e re le a s e d b y in vo kin g try_to_free_buffers( ). [ 3 ] If t h e fu n ct io n fa ils , p re s u m a b ly b e ca u s e s o m e p ro ce s s is u s in g t h e b u ffe rs , grow_dev_page( ) re t u rn s NULL ( it wa s n o t a b le t o a llo ca t e t h e b u ffe r p a g e fo r t h e re q u e s t e d b lo ck) . [3]

Th is ca n h a p p e n wh e n t h e p a g e wa s p re vio u s ly in vo lve d in a p a g e I/ O o p e ra t io n u s in g a d iffe re n t b lo ck s ize , a n d t h e co rre s p o n d in g a s yn ch ro n o u s b u ffe r h e a d s a re s t ill a llo ca t e d . 4 . In vo ke s t h e create_buffers( ) fu n ct io n t o a llo ca t e t h e b u ffe r h e a d s fo r t h e b lo cks o f t h e re q u e s t e d s ize wit h in t h e p a g e . Th e a d d re s s o f t h e b u ffe r h e a d fo r t h e firs t b u ffe r in t h e p a g e is s t o re d in t h e buffers fie ld o f t h e p a g e d e s crip t o r, a n d a ll b u ffe r h e a d s a re in s e rt e d in t o t h e s im p ly lin ke d circu la r lis t im p le m e n t e d b y t h e b_this_page fie ld s o f t h e b u ffe r h e a d s . Mo re o ve r, t h e b_page fie ld s o f t h e b u ffe r h e a d s a re in it ia lize d wit h t h e a d d re s s o f t h e p a g e d e s crip t o r. 5 . Re t u rn s t h e p a g e d e s crip t o r a d d re s s . ●

If grow_dev_page( ) re t u rn e d NULL, re t u rn s 0 ( fa ilu re ) .



In vo ke s t h e hash_page_buffers( ) fu n ct io n t o in it ia lize t h e fie ld s o f a ll b u ffe r h e a d s in t h e

s im p ly lin ke d circu la r lis t o f t h e b u ffe r p a g e a n d in s e rt t h e m in t o t h e b u ffe r ca ch e . ●

Un lo cks t h e p a g e ( t h e p a g e wa s lo cke d b y find_or_create_page( ))



De cre m e n t s t h e p a g e 's u s a g e co u n t e r ( a g a in , t h e co u n t e r wa s in cre m e n t e d b y

find_or_create_page( )) ●

In cre m e n t s t h e buffermem_pages va ria b le , wh ich s t o re s t h e t o t a l n u m b e r o f b u ffe r p a g e s —

t h a t is , t h e m e m o ry cu rre n t ly ca ch e d b y t h e b u ffe r ca ch e in p a g e - s ize u n it s . ●

Re t u rn s 1 ( s u cce s s ) .

14.2.3 The getblk( ) Function Th e getblk( ) fu n ct io n is t h e m a in s e rvice ro u t in e fo r t h e b u ffe r ca ch e . Wh e n t h e ke rn e l n e e d s t o re a d o r writ e t h e co n t e n t s o f a b lo ck o f a p h ys ica l d e vice , it m u s t ch e ck wh e t h e r t h e b u ffe r h e a d fo r t h e re q u ire d b u ffe r is a lre a d y in clu d e d in t h e b u ffe r ca ch e . If t h e b u ffe r is n o t t h e re , t h e ke rn e l m u s t cre a t e a n e w e n t ry in t h e ca ch e . To d o t h is , t h e ke rn e l in vo ke s getblk( ), s p e cifyin g a s p a ra m e t e rs t h e d e vice id e n t ifie r, t h e b lo ck n u m b e r, a n d t h e b lo ck s ize . Th is fu n ct io n re t u rn s t h e a d d re s s o f t h e b u ffe r h e a d a s s o cia t e d wit h t h e b u ffe r. Re m e m b e r t h a t h a vin g a b u ffe r h e a d in t h e ca ch e d o e s n o t im p ly t h a t t h e d a t a in t h e b u ffe r is va lid . ( Fo r in s t a n ce , t h e b u ffe r h a s ye t t o b e re a d fro m d is k. ) An y fu n ct io n t h a t re a d s b lo cks m u s t ch e ck wh e t h e r t h e b u ffe r o b t a in e d fro m getblk( ) is u p t o d a t e ; if n o t , it m u s t re a d t h e b lo ck firs t fro m d is k b e fo re u s in g t h e b u ffe r. Th e getblk( ) fu n ct io n lo o ks d e ce p t ive ly s im p le :

struct buffer_head * getblk(kdev_t dev, int block, int size) { for (;;) { struct buffer_head * bh; bh = get_hash_table(dev, block, size); if (bh) return bh; if (!grow_buffers(dev, block, size)) free_more_memory( ); } } Th e function firs t in vo ke s get_hash_table( ) ( s e e t h e e a rlie r s e ct io n S e ct io n 1 4 . 2 . 1 . 3 ) t o ch e ck wh e t h e r t h e re q u ire d b u ffe r h e a d is a lre a d y in t h e ca ch e . If s o , getblk( ) re t u rn s t h e b u ffe r h e a d a d d re s s . Ot h e rwis e , if t h e re q u ire d b u ffe r h e a d is n o t in t h e ca ch e , getblk( ) in vo ke s grow_buffers( ) t o a llo ca t e a n e w b u ffe r p a g e t h a t co n t a in s t h e b u ffe r fo r t h e re q u e s t e d b lo ck. If grow_buffers(

) fa ils in a llo ca t in g s u ch a p a g e , getblk( ) t rie s t o re cla im s o m e m e m o ry ( s e e Ch a p t e r 1 6 ) . Th e s e a ct io n s a re re p e a t e d u n t il get_hash_table( ) s u cce e d s in fin d in g t h e re q u e s t e d b u ffe r in t h e b u ffe r ca ch e .

14.2.4 Writing Dirty Buffers to Disk Un ix s ys t e m s a llo w t h e d e fe rre d writ e s o f d irt y b u ffe rs in t o b lo ck d e vice s , s in ce t h is n o t ice a b ly im p ro ve s s ys t e m p e rfo rm a n ce . S e ve ra l writ e o p e ra t io n s o n a b u ffe r co u ld b e s a t is fie d b y ju s t o n e s lo w p h ys ica l u p d a t e o f t h e co rre s p o n d in g d is k b lo ck. Mo re o ve r, writ e o p e ra t io n s a re le s s crit ica l t h a n re a d o p e ra t io n s , s in ce a p ro ce s s is u s u a lly n o t s u s p e n d e d b e ca u s e o f d e la ye d writ in g s , wh ile it is m o s t o ft e n s u s p e n d e d b e ca u s e o f d e la ye d re a d s . Th a n ks t o d e fe rre d writ e s , e a ch p h ys ica l b lo ck d e vice will s e rvice , o n t h e a ve ra g e , m a n y m o re re a d re q u e s t s t h a n writ e o n e s .

A d irt y b u ffe r m ig h t s t a y in m a in m e m o ry u n t il t h e la s t p o s s ib le m o m e n t — t h a t is , u n t il s ys t e m s h u t d o wn . Ho we ve r, p u s h in g t h e d e la ye d - writ e s t ra t e g y t o it s lim it s h a s t wo m a jo r d ra wb a cks : ●



If a h a rd wa re o r p o we r s u p p ly fa ilu re o ccu rs , t h e co n t e n t s o f RAM ca n n o lo n g e r b e re t rie ve d , s o m a n y file u p d a t e s t h a t we re m a d e s in ce t h e s ys t e m wa s b o o t e d a re lo s t . Th e s ize o f t h e b u ffe r ca ch e , a n d h e n ce o f t h e RAM re q u ire d t o co n t a in it , wo u ld h a ve t o b e h u g e —a t le a s t a s b ig a s t h e s ize o f t h e a cce s s e d b lo ck d e vice s .

Th e re fo re , d irt y b u ffe rs a re flu s h e d ( writ t e n ) t o d is k u n d e r t h e fo llo win g co n d it io n s : ●





Th e b u ffe r ca ch e g e t s t o o fu ll a n d m o re b u ffe rs a re n e e d e d , o r t h e n u m b e r o f d irt y b u ffe rs b e co m e s t o o la rg e ; wh e n o n e o f t h e s e co n d it io n s o ccu rs , t h e b d flu s h ke rn e l t h re a d is a ct iva t e d . To o m u ch t im e h a s e la p s e d s in ce a b u ffe r h a s s t a ye d d irt y; t h e k u p d a t e ke rn e l t h re a d re g u la rly flu s h e s o ld b u ffe rs . A p ro ce s s re q u e s t s a ll t h e b u ffe rs o f b lo ck d e vice s o r o f p a rt icu la r file s t o b e flu s h e d ; it d o e s t h is b y in vo kin g t h e sync( ), fsync( ), o r fdatasync( ) s ys t e m ca ll.

As e xp la in e d in t h e e a rlie r s e ct io n S e ct io n 1 4 . 2 . 2 , a b u ffe r p a g e is d irt y ( PG_DIRTY fla g s e t ) if s o m e o f it s b u ffe rs a re d irt y. As s o o n a s t h e ke rn e l flu s h e s a ll d irt y b u ffe rs in a b u ffe r p a g e t o d is k, it re s e t s t h e PG_DIRTY fla g o f t h e p a g e .

14.2.4.1 The bdflush kernel thread Th e b d flu s h ke rn e l t h re a d ( a ls o ca lle d k flu s h d ) is cre a t e d d u rin g s ys t e m in it ia liza t io n . It e xe cu t e s t h e bdflush( ) fu n ct io n , wh ich s e le ct s s o m e d irt y b u ffe rs a n d fo rce s a n u p d a t e o f t h e co rre s p o n d in g b lo cks o n t h e p h ys ica l b lo ck d e vice s . S o m e s ys t e m p a ra m e t e rs co n t ro l t h e b e h a vio r o f b d flu s h ; t h e y a re s t o re d in t h e b_un fie ld o f t h e

bdf_prm t a b le a n d a re a cce s s ib le e it h e r b y m e a n s o f t h e / p ro c/ s y s / v m / b d flu s h file o r b y in vo kin g t h e bdflush( ) s ys t e m ca ll. Ea ch p a ra m e t e r h a s a d e fa u lt s t a n d a rd va lu e , a lt h o u g h it m a y va ry wit h in a m in im u m a n d a m a xim u m va lu e s t o re d in t h e bdflush_min a n d bdflush_max t a b le s , re s p e ct ive ly. Th e p a ra m e t e rs a re lis t e d in Ta b le 1 4 - 4 . [ 4 ] [4]

Th e bdf_prm t a b le a ls o in clu d e s s e ve ra l o t h e r u n u s e d fie ld s .

Ta b le 1 4 - 4 . Bu ffe r c a c h e t u n in g p a ra m e t e rs

P a ra m e t e r

D e fa u lt Min Ma x

D e s c rip t io n

nfract

40

0

100

Th re s h o ld p e rce n t a g e o f d irt y b u ffe rs fo r wa kin g u p b d flu s h

nfract_sync 6 0

0

100

Th re s h o ld p e rce n t a g e o f d irt y b u ffe rs fo r wa kin g u p b d flu s h in b lo ckin g m o d e

age_buffer 3 0 0 0

100 600,000

Tim e - o u t in t icks o f a d irt y b u ffe r fo r b e in g writ t e n t o d is k

interval

500

0

1 , 0 0 0 , 0 0 0 De la y in t icks b e t we e n k u p d a t e a ct iva t io n s

Th e m o s t t yp ica l ca s e s t h a t ca u s e t h e ke rn e l t h re a d t o b e wo ke n u p a re : ●

Th e balance_dirty( ) fu n ct io n ve rifie s t h a t t h e n u m b e r o f b u ffe r p a g e s in t h e

BUF_DIRTY a n d BUF_LOCKED lis t s e xce e d s t h e t h re s h o ld : P

x

bdf_prm.b_un.nfract_sync / 100

wh e re P re p re s e n t s t h e n u m b e r o f p a g e s in t h e s ys t e m t h a t ca n b e u s e d a s b u ffe r p a g e s ( e s s e n t ia lly, t h is is a ll t h e p a g e s in t h e "DMA" a n d "No rm a l" m e m o ry zo n e s ; s e e S e ct io n 7 . 1 . 2 ) . Act u a lly, t h e co m p u t a t io n is d o n e b y t h e balance_dirty_state( ) h e lp e r fu n ct io n , wh ich re t u rn s - 1 if t h e n u m b e r o f d irt y o r lo cke d b u ffe rs is b e lo w t h e nfract t h re s h o ld , 0 if it is b e t we e n nfract a n d nfract_sync, a n d 1 if it is a b o ve nfract_sync. Th e balance_dirty( ) fu n ct io n is u s u a lly in vo ke d wh e n e ve r a b u ffe r is m a rke d a s "d irt y" a n d t h e fu n ct io n m o ve s it s b u ffe r h e a d in t o t h e BUF_DIRTY lis t .



Wh e n t h e try_to_free_buffers( ) fu n ct io n fa ils t o re le a s e t h e b u ffe r h e a d s o f s o m e b u ffe r p a g e ( s e e t h e e a rlie r s e ct io n S e ct io n 1 4 . 2 . 2 . 1 ) .



Wh e n t h e grow_buffers( ) fu n ct io n fa ils t o a llo ca t e a n e w b u ffe r p a g e , o r t h e

create_buffers( ) fu n ct io n fa ils t o a llo ca t e a n e w b u ffe r h e a d ( s e e t h e e a rlie r s e ct io n ●

S e ct io n 1 4 . 2 . 2 . 1 ) . Wh e n a u s e r p re s s e s s o m e s p e cific co m b in a t io n s o f ke ys o n t h e co n s o le ( u s u a lly ALT+SysRq+U a n d ALT+SysRq+S) . Th e s e ke y co m b in a t io n s , wh ich a re e n a b le d o n ly if t h e Lin u x ke rn e l h a s b e e n co m p ile d wit h t h e Ma g ic S ys Rq Ke y o p t io n , a llo w Lin u x h a cke rs t o h a ve s o m e e xp licit co n t ro l o ve r ke rn e l b e h a vio r.

To wa ke u p b d flu s h , t h e ke rn e l in vo ke s t h e wakeup_bdflush( ) fu n ct io n , wh ich s im p ly e xe cu t e s :

wake_up_interruptible(&bdflush_wait); t o wa ke u p t h e p ro ce s s s u s p e n d e d in t h e bdflush_wait t a s k q u e u e . Th e re is ju s t o n e p ro ce s s in t h is wa it q u e u e , n a m e ly b d flu s h it s e lf. Th e co re o f t h e bdflush( ) fu n ct io n is t h e fo llo win g e n d le s s lo o p :

for (;;) { if (emergency_sync_scheduled) /* Only if the kernel has been compiled */ do_emergency_sync( ); /* with Magic SysRq Key support */ spin_lock(&lru_list_lock); if (!write_some_buffers(0) || balance_dirty_state( ) < 0) { wait_for_some_buffers(0); interruptible_sleep_on(&bdflush_wait); } } If t h e Lin u x ke rn e l h a s b e e n co m p ile d wit h t h e Ma g ic S ys Rq Ke y o p t io n , bdflush( ) ch e cks wh e t h e r t h e u s e r h a s re q u e s t e d a n e m e rg e n cy s yn c. If s o , t h e fu n ct io n in vo ke s

do_emergency_sync( ) t o e xe cu t e fsync_dev( ) o n a ll e xis t in g b lo ck d e vice s , flu s h in g a ll d irt y b u ffe rs ( s e e t h e la t e r s e ct io n S e ct io n 1 4 . 2 . 4 . 3 ) .

Ne xt , t h e fu n ct io n a cq u ire s t h e lru_list_lock s p in lo ck, a n d in vo ke s t h e

write_some_buffers( ) fu n ct io n , wh ich t rie s t o a ct iva t e b lo ck I/ O writ e o p e ra t io n s fo r u p t o 3 2 u n lo cke d d irt y b u ffe rs . On ce t h e writ e o p e ra t io n s h a ve b e e n a ct iva t e d , write_some_buffers( ) re le a s e s t h e lru_list_lock s p in lo ck a n d re t u rn s 0 if le s s t h a n 3 2 u n lo cke d d irt y b u ffe rs h a ve b e e n fo u n d ; it re t u rn s a n e g a t ive va lu e o t h e rwis e . If write_some_buffers( ) d id n 't fin d 3 2 b u ffe rs t o flu s h , o r t h e n u m b e r o f d irt y o r lo cke d b u ffe rs fa lls b e lo w t h e p e rce n t a g e t h re s h o ld g ive n b y t h e b d flu s h 's p a ra m e t e r nfract, t h e b d flu s h ke rn e l t h re a d g o e s t o s le e p . To d o t h is , it firs t in vo ke s t h e wait_for_some_buffers( ) fu n ct io n s o t h a t it s le e p s u n t il a ll I/ O d a t a t ra n s fe rs o f t h e b u ffe rs in t h e BUF_LOCKED lis t t e rm in a t e . Du rin g t h is t im e in t e rva l, t h e ke rn e l t h re a d is n o t wo ke n u p e ve n if t h e ke rn e l e xe cu t e s t h e

wakeup_bdflush( ) fu n ct io n . On ce d a t a t ra n s fe rs t e rm in a t e , t h e bdflush( ) fu n ct io n in vo ke s interruptible_sleep_on( ) o n t h e bdflush_wait wa it q u e u e t o s le e p u n t il t h e n e xt wakeup_bdflush( ) in vo ca t io n . 14.2.4.2 The kupdate kernel thread S in ce t h e b d flu s h ke rn e l t h re a d is u s u a lly a ct iva t e d o n ly wh e n t h e re a re t o o m a n y d irt y b u ffe rs o r wh e n m o re b u ffe rs a re n e e d e d a n d a va ila b le m e m o ry is s ca rce , s o m e d irt y b u ffe rs m ig h t s t a y in RAM fo r a n a rb it ra rily lo n g t im e b e fo re b e in g flu s h e d t o d is k. Th e k u p d a t e ke rn e l t h re a d is t h u s in t ro d u ce d t o flu s h t h e o ld e r d irt y b u ffe rs . [ 5 ] [5]

In a n e a rlie r ve rs io n o f Lin u x 2 . 2 , t h e s a m e t a s k wa s a ch ie ve d b y m e a n s o f t h e bdflush( ) s ys t e m ca ll, wh ich wa s in vo ke d e ve ry five s e co n d s b y a Us e r Mo d e s ys t e m p ro ce s s la u n ch e d a t s ys t e m s t a rt u p a n d wh ich e xe cu t e d t h e / s b in / u p d a t e p ro g ra m . In m o re re ce n t ke rn e l ve rs io n s , t h e bdflush( )s ys t e m ca ll is u s e d o n ly t o a llo w u s e rs t o m o d ify t h e s ys t e m p a ra m e t e rs in t h e bdf_prm t a b le . As s h o wn in Ta b le 1 4 - 4 , age_buffer is a t im e - o u t p a ra m e t e r t h a t s p e cifie s t h e t im e fo r b u ffe rs t o a g e b e fo re k u p d a t e writ e s t h e m t o d is k ( u s u a lly 3 0 s e co n d s ) , wh ile t h e interval fie ld o f t h e

bdf_prm t a b le s t o re s t h e d e la y in t icks b e t we e n t wo a ct iva t io n s o f t h e k u p d a t e ke rn e l t h re a d ( u s u a lly five s e co n d s ) . If t h is fie ld is n u ll, t h e ke rn e l t h re a d is n o rm a lly s t o p p e d , a n d is a ct iva t e d o n ly wh e n it re ce ive s a SIGCONT s ig n a l.

Wh e n t h e ke rn e l m o d ifie s t h e co n t e n t s o f s o m e b u ffe r, it s e t s t h e b_flushtime fie ld o f t h e co rre s p o n d in g b u ffe r h e a d t o t h e t im e ( in jiffie s ) wh e n it s h o u ld la t e r b e flu s h e d t o d is k. Th e k u p d a t e ke rn e l t h re a d s e le ct s o n ly t h e d irt y b u ffe rs wh o s e b_flushtime fie ld is s m a lle r t h a n t h e cu rre n t va lu e o f jiffies.

Th e k u p d a t e ke rn e l t h re a d ru n s t h e kupdate( ) fu n ct io n ; it ke e p s e xe cu t in g t h e fo llo win g e n d le s s lo o p :

for (;;) { wait_for_some_buffers(0); if (bdf_prm.b_un.interval) { tsk->state = TASK_INTERRUPTIBLE; schedule_timeout(bdf_prm.b_un.interval); } else { tsk->state = TASK_STOPPED; schedule( ); /* wait for SIGCONT */ }

sync_old_buffers( ); } Firs t o f a ll, t h e ke rn e l t h re a d s u s p e n d s it s e lf u n t il t h e I/ O d a t a t ra n s fe rs h a ve b e e n co m p le t e d fo r a ll b u ffe rs in t h e BUF_LOCKED lis t . Th e n , if bdf.prm.b_un.interval in t e rva l is n o t n u ll, t h e t h re a d g o e s t o s le e p fo r t h e s p e cifie d a m o u n t o f t icks ( s e e S e ct io n 6 . 6 . 2 ) ; o t h e rwis e , t h e t h re a d s t o p s it s e lf u n t il a SIGCONT s ig n a l is re ce ive d ( s e e S e ct io n 1 0 . 1 ) .

Th e co re o f t h e kupdate( ) fu n ct io n co n s is t s o f t h e sync_old_buffers( ) fu n ct io n . Th e o p e ra t io n s t o b e p e rfo rm e d a re ve ry s im p le fo r s t a n d a rd file s ys t e m s u s e d wit h Un ix; a ll t h e fu n ct io n h a s t o d o is writ e d irt y b u ffe rs t o d is k. Ho we ve r, s o m e n o n n a t ive file s ys t e m s in t ro d u ce co m p le xit ie s b e ca u s e t h e y s t o re t h e ir s u p e rb lo ck o r in o d e in fo rm a t io n in co m p lica t e d wa ys . sync_old_buffers( ) e xe cu t e s t h e fo llo win g s t e p s :

1 . Acq u ire s t h e b ig ke rn e l lo ck. 2 . In vo ke s sync_unlocked_inodes( ), wh ich s ca n s t h e s u p e rb lo cks o f a ll cu rre n t ly m o u n t e d file s ys t e m s a n d , fo r e a ch s u p e rb lo ck, t h e lis t o f d irt y in o d e s t o wh ich t h e s_dirty fie ld o f t h e s u p e rb lo ck o b je ct p o in t s . Fo r e a ch in o d e , t h e fu n ct io n flu s h e s t h e d irt y p a g e s t h a t b e lo n g t o m e m o ry m a p p in g s o f t h e co rre s p o n d in g file ( s e e S e ct io n 1 5 . 2 . 5 ) , t h e n in vo ke s t h e write_inode s u p e rb lo ck o p e ra t io n if it is d e fin e d . ( Th e write_inode m e t h o d is d e fin e d o n ly b y n o n - Un ix file s ys t e m s t h a t d o n o t s t o re a ll t h e in o d e d a t a in s id e a s in g le d is k b lo ck — fo r in s t a n ce , t h e MS - DOS file s ys t e m ) . 3 . In vo ke s sync_supers( ), wh ich t a ke s ca re o f s u p e rb lo cks u s e d b y file s ys t e m s t h a t d o n o t s t o re a ll t h e s u p e rb lo ck d a t a in a s in g le d is k b lo ck ( a n e xa m p le is Ap p le Ma cin t o s h 's HFS ) . Th e fu n ct io n a cce s s e s t h e s u p e rb lo cks lis t o f a ll cu rre n t ly m o u n t e d file s ys t e m s ( s e e S e ct io n 1 2 . 4 ) . It t h e n in vo ke s , fo r e a ch s u p e rb lo ck, t h e co rre s p o n d in g write_super s u p e rb lo ck o p e ra t io n , if o n e is d e fin e d ( s e e S e ct io n 1 2 . 2 . 1 ) . Th e write_super m e t h o d is n o t d e fin e d fo r a n y Un ix file s ys t e m . 4 . Re le a s e s t h e b ig ke rn e l lo ck. 5 . S t a rt s a lo o p co n s is t in g o f t h e fo llo win g s t e p s : a . Ge t s t h e lru_list_lock s p in lo ck.

b . Ge t s t h e bh p o in t e r t o t h e firs t b u ffe r h e a d in t h e BUF_DIRTY lis t .

c. If t h e p o in t e r is n u ll o r if t h e b_flushtime b u ffe r h e a d fie ld h a s a va lu e g re a t e r t h a n jiffies ( yo u n g b u ffe r) , re le a s e s t h e lru_list_lock s p in lo ck a n d t e rm in a t e s . d . In vo ke s write_some_buffers( ), wh ich t rie s t o a ct iva t e b lo ck I/ O writ e o p e ra t io n s fo r u p t o 3 2 u n lo cke d d irt y b u ffe rs in t h e BUF_DIRTY lis t . On ce t h e writ e a ct iva t io n s h a ve b e e n p e rfo rm e d , write_some_buffers( ) re le a s e s t h e lru_list_lock s p in lo ck a n d re t u rn s 0 if le s s t h a n 3 2 u n lo cke d d irt y b u ffe rs h a ve b e e n fo u n d ; it re t u rn s a n e g a t ive va lu e o t h e rwis e . e . If write_some_buffers( ) flu s h e d t o d is k e xa ct ly 3 2 u n lo cke d d irt y b u ffe rs , ju m p s t o S t e p 5 a ; o t h e rwis e , t e rm in a t e s t h e e xe cu t io n .

14.2.4.3 The sync( ), fsync( ), and fdatasync( ) system calls Th re e d iffe re n t s ys t e m ca lls a re a va ila b le t o u s e r a p p lica t io n s t o flu s h d irt y b u ffe rs t o d is k:

sync( ) Us u a lly is s u e d b e fo re a s h u t d o wn , s in ce it flu s h e s a ll d irt y b u ffe rs t o d is k

fsync( ) Allo ws a p ro ce s s t o flu s h a ll b lo cks t h a t b e lo n g t o a s p e cific o p e n file t o d is k

fdatasync( ) Ve ry s im ila r t o fsync( ), b u t d o e s n 't flu s h t h e in o d e b lo ck o f t h e file

Th e co re o f t h e sync( ) s ys t e m ca ll is t h e fsync_dev( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s sync_buffers( ), wh ich e s s e n t ia lly e xe cu t e s t h e fo llo win g co d e : do { spin_lock(&lru_list_lock); } while (write_some_buffers(0)); run_task_queue(&tq_disk);

As yo u s e e , t h e fu n ct io n ke e p s in vo kin g t h e write_some_buffers( ) fu n ct io n u n t il it s u cce e d s in fin d in g 3 2 u n lo cke d , d irt y b u ffe rs . Th e n , t h e b lo ck d e vice d rive rs a re u n p lu g g e d t o s t a rt re a l I/ O d a t a t ra n s fe rs ( s e e S e ct io n 1 3 . 4 . 6 . 2 ) . 2 . Acq u ire s t h e b ig ke rn e l lo ck. 3 . In vo ke s sync_inodes( ), wh ich is q u it e s im ila r t o t h e sync_unlocked_inodes( ) fu n ct io n d is cu s s e d in t h e p re vio u s s e ct io n . 4 . In vo ke s sync_supers( ) t o writ e t h e d irt y s u p e rb lo cks t o d is k, if n e ce s s a ry, b y u s in g t h e

write_super m e t h o d s ( s e e e a rlie r in t h is s e ct io n ) . 5 . Re le a s e s t h e b ig ke rn e l lo ck. 6 . In vo ke s sync_buffers( ) o n ce a g a in . Th is t im e , it wa it s u n t il a ll lo cke d b u ffe rs h a ve b e e n t ra n s fe rre d . Th e fsync( ) s ys t e m ca ll fo rce s t h e ke rn e l t o writ e t o d is k a ll d irt y b u ffe rs t h a t b e lo n g t o t h e file s p e cifie d b y t h e fd file d e s crip t o r p a ra m e t e r ( in clu d in g t h e b u ffe r co n t a in in g it s in o d e , if n e ce s s a ry) . Th e s ys t e m s e rvice ro u t in e d e rive s t h e a d d re s s o f t h e file o b je ct a n d t h e n in vo ke s t h e fsync m e t h o d . Us u a lly, t h is m e t h o d s im p ly in vo ke s t h e fsync_inode_buffers( ) fu n ct io n , wh ich s ca n s t h e t wo lis t s o f d irt y b u ffe rs o f t h e in o d e o b je ct ( s e e t h e e a rlie r s e ct io n S e ct io n 1 4 . 2 . 1 . 3 ) , a n d in vo ke s ll_rw_block( ) o n e a ch e le m e n t p re s e n t in t h e lis t s . Th e fu n ct io n t h e n s u s p e n d s t h e ca llin g p ro ce s s u n t il a ll d irt y b u ffe rs o f t h e file h a ve b e e n writ t e n t o d is k b y in vo kin g

wait_on_buffer( ) o n e a ch lo cke d b u ffe r. Mo re o ve r, t h e s e rvice ro u t in e o f t h e fsync( )

s ys t e m ca ll flu s h e s t h e d irt y p a g e s t h a t b e lo n g t o t h e m e m o ry m a p p in g o f t h e file , if a n y ( s e e S e ct io n 1 5 . 2 . 5 ) . Th e fdatasync( ) s ys t e m ca ll is ve ry s im ila r t o fsync( ), b u t writ e s t o d is k o n ly t h e b u ffe rs t h a t co n t a in t h e file 's d a t a , n o t t h o s e t h a t co n t a in in o d e in fo rm a t io n . S in ce Lin u x 2 . 4 d o e s n o t h a ve a s p e cific file m e t h o d fo r fdatasync( ), t h is s ys t e m ca ll u s e s t h e fsync m e t h o d a n d is t h u s id e n t ica l t o fsync( ).

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 15. Accessing Files Acce s s in g a file is a co m p le x a ct ivit y t h a t in vo lve s t h e VFS a b s t ra ct io n ( Ch a p t e r 1 2 ) , h a n d lin g b lo ck d e vice s ( Ch a p t e r 1 3 ) , a n d t h e u s e o f d is k ca ch e s ( Ch a p t e r 1 4 ) . Th is ch a p t e r s h o ws h o w t h e ke rn e l b u ild s o n a ll t h o s e fa cilit ie s t o ca rry o u t file re a d s a n d writ e s . Th e t o p ics co ve re d in t h is ch a p t e r a p p ly b o t h t o re g u la r file s s t o re d in d is k- b a s e d file s ys t e m s a n d t o b lo ck d e vice file s ; t h e s e t wo kin d s o f file s will b e re fe rre d t o s im p ly a s "file s . " Th e s t a g e we a re wo rkin g a t in t h is ch a p t e r s t a rt s a ft e r t h e p ro p e r re a d o r writ e m e t h o d o f a p a rt icu la r file h a s b e e n ca lle d ( a s d e s crib e d in Ch a p t e r 1 2 ) . We s h o w h e re h o w e a ch re a d e n d s wit h t h e d e s ire d d a t a d e live re d t o a Us e r Mo d e p ro ce s s a n d h o w e a ch writ e e n d s wit h d a t a m a rke d re a d y fo r t ra n s fe r t o d is k. Th e re s t o f t h e t ra n s fe r is h a n d le d b y t h e fa cilit ie s d e s crib e d in Ch a p t e r 1 3 a n d Ch a p t e r 1 4 . In p a rt icu la r, in S e ct io n 1 5 . 1 , we d e s crib e h o w file s a re a cce s s e d b y m e a n s o f t h e read( ) a n d write( ) s ys t e m ca lls . Wh e n a p ro ce s s re a d s fro m a file , d a t a is firs t m o ve d fro m t h e d is k it s e lf t o a s e t o f b u ffe rs in t h e ke rn e l's a d d re s s s p a ce . Th is s e t o f b u ffe rs is in clu d e d in a s e t o f p a g e s in t h e p a g e ca ch e ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) . Ne xt , t h e p a g e s a re co p ie d in t o t h e p ro ce s s 's u s e r a d d re s s s p a ce . A writ e is b a s ica lly t h e o p p o s it e , a lt h o u g h s o m e s t a g e s a re d iffe re n t fro m re a d s in im p o rt a n t wa ys . In Ch a p t e r 1 5 , we d is cu s s h o w t h e ke rn e l a llo ws a p ro ce s s t o d ire ct ly m a p a re g u la r file in t o it s a d d re s s s p a ce , b e ca u s e t h a t a ct ivit y a ls o h a s t o d e a l wit h p a g e s in ke rn e l m e m o ry. Fin a lly, in S e ct io n 1 5 . 3 , we d is cu s s t h e ke rn e l s u p p o rt t o s e lf- ca ch in g a p p lica t io n s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

15.1 Reading and Writing a File S e ct io n 1 2 . 6 . 2 , d e s crib e d h o w t h e read( ) a n d write( ) s ys t e m ca lls a re im p le m e n t e d . Th e co rre s p o n d in g s e rvice ro u t in e s e n d u p in vo kin g t h e file o b je ct 's read a n d write m e t h o d s , wh ich m a y b e file s ys t e m - d e p e n d e n t . Fo r d is k- b a s e d file s ys t e m s , t h e s e m e t h o d s lo ca t e t h e p h ys ica l b lo cks t h a t co n t a in t h e d a t a b e in g a cce s s e d , a n d a ct iva t e t h e b lo ck d e vice d rive r t o s t a rt t h e d a t a t ra n s fe r. Re a d in g a file is p a g e - b a s e d : t h e ke rn e l a lwa ys t ra n s fe rs wh o le p a g e s o f d a t a a t o n ce . If a p ro ce s s is s u e s a read(

) s ys t e m ca ll t o g e t a fe w b yt e s , a n d t h a t d a t a is n o t a lre a d y in RAM, t h e ke rn e l a llo ca t e s a n e w p a g e fra m e , fills t h e p a g e wit h t h e s u it a b le p o rt io n o f t h e file , a d d s t h e p a g e t o t h e p a g e ca ch e , a n d fin a lly co p ie s t h e re q u e s t e d b yt e s in t o t h e p ro ce s s a d d re s s s p a ce . Fo r m o s t file s ys t e m s , re a d in g a p a g e o f d a t a fro m a file is ju s t a m a t t e r o f fin d in g wh a t b lo cks o n d is k co n t a in t h e re q u e s t e d d a t a . On ce t h is is d o n e , t h e ke rn e l ca n u s e o n e o r m o re p a g e I/ O o p e ra t io n s t o fill t h e p a g e s . Th e read m e t h o d o f m o s t file s ys t e m s is im p le m e n t e d b y a co m m o n fu n ct io n n a m e d

generic_file_read( ). Writ e o p e ra t io n s o n d is k- b a s e d file s a re s lig h t ly m o re co m p lica t e d t o h a n d le , s in ce t h e file s ize co u ld ch a n g e , a n d t h e re fo re t h e ke rn e l m ig h t a llo ca t e o r re le a s e s o m e p h ys ica l b lo cks o n t h e d is k. Of co u rs e , h o w t h is is p re cis e ly d o n e d e p e n d s o n t h e file s ys t e m t yp e . Ho we ve r, m a n y d is k- b a s e d file s ys t e m s im p le m e n t t h e ir write m e t h o d s b y m e a n s o f a co m m o n fu n ct io n n a m e d generic_file_write( ). Exa m p le s o f s u ch file s ys t e m s a re Ext 2 , S ys t e m V/ Co h e re n t / Xe n ix, a n d Min ix. On t h e o t h e r h a n d , s e ve ra l o t h e r file s ys t e m s , s u ch a s jo u rn a lin g a n d n e t wo rk file s ys t e m s , im p le m e n t t h e write m e t h o d b y m e a n s o f cu s t o m fu n ct io n s .

15.1.1 Reading from a File Th e read m e t h o d o f t h e re g u la r file s t h a t b e lo n g t o a lm o s t a ll d is k- b a s e d file s ys t e m s , a s we ll a s t h e read m e t h o d o f a n y b lo ck d e vice file , is im p le m e n t e d b y t h e generic_file_read( ) fu n ct io n . It a ct s o n t h e fo llo win g p a ra m e t e rs :

filp Ad d re s s o f t h e file o b je ct

buf Lin e a r a d d re s s o f t h e Us e r Mo d e m e m o ry a re a wh e re t h e ch a ra ct e rs re a d fro m t h e file m u s t b e s t o re d

count Nu m b e r o f ch a ra ct e rs t o b e re a d

ppos Po in t e r t o a va ria b le t h a t s t o re s t h e o ffs e t fro m wh ich re a d in g m u s t s t a rt ( u s u a lly t h e f_pos fie ld o f t h e

filp file o b je ct ) As a firs t s t e p , t h e fu n ct io n ch e cks wh e t h e r t h e O_DIRECT fla g o f t h e file o b je ct is s e t . If s o , t h e re a d a cce s s s h o u ld b yp a s s t h e p a g e ca ch e ; we d is cu s s t h is s p e cia l ca s e in t h e la t e r s e ct io n S e ct io n 1 5 . 3 . Le t 's a s s u m e t h a t t h e O_DIRECT fla g is n o t s e t . Th e fu n ct io n in vo ke s access_ok( ) t o ve rify t h a t t h e buf a n d count p a ra m e t e rs re ce ive d fro m t h e s ys t e m ca ll s e rvice ro u t in e sys_read( ) a re co rre ct , a n d re t u rn s t h e EFAULT e rro r co d e if t h e y a re n 't ( s e e S e ct io n 9 . 2 . 4 ) .

If e ve ryt h in g is o k, generic_file_read( ) a llo ca t e s a re a d o p e ra t io n d e s crip t o r — n a m e ly, a d a t a s t ru ct u re o f t yp e read_descriptor_t t h a t s t o re s t h e cu rre n t s t a t u s o f t h e o n g o in g file re a d o p e ra t io n . Th e fie ld s o f t h is d e s crip t o r a re s h o wn in Ta b le 1 5 - 1 .

Ta b le 1 5 - 1 . Th e fie ld s o f t h e re a d o p e ra t io n d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

size_t

written

Ho w m a n y b yt e s h a ve b e e n t ra n s fe rre d

size_t

count

Ho w m a n y b yt e s a re ye t t o b e t ra n s fe rre d

char *

buf

Cu rre n t p o s it io n in Us e r Mo d e b u ffe r

int

error

Erro r co d e o f t h e re a d o p e ra t io n ( 0 fo r n o e rro r)

Th e n t h e fu n ct io n in vo ke s do_generic_file_read( ), p a s s in g t o it t h e file o b je ct p o in t e r filp, t h e p o in t e r t o t h e file o ffs e t ppos, t h e a d d re s s o f t h e ju s t a llo ca t e d re a d o p e ra t io n d e s crip t o r desc, a n d t h e a d d re s s o f t h e file_read_actor( ) fu n ct io n ( s e e la t e r) . Th e do_generic_file_read( ) fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : [ 1 ] [1]

As u s u a l, fo r t h e s a ke o f s im p licit y, we d o n o t d is cu s s h o w e rro rs a n d a n o m a lo u s co n d it io n s a re h a n d le d . 1 . Ge t s t h e address_space o b je ct co rre s p o n d in g t o t h e file b e in g re a d ; it s a d d re s s is s t o re d in filp-

>f_dentry->d_inode->i_mapping. 2 . Ge t s t h e in o d e o b je ct t h a t o wn s t h e a d d re s s s p a ce ; it s a d d re s s is s t o re d in t h e h o s t fie ld o f t h e address_space o b je ct . No t ice t h a t t h is o b je ct co u ld b e d iffe re n t fro m t h e in o d e p o in t e d t o b y filp-

>f_dentry->d_inode ( s e e S e ct io n 1 3 . 4 . 1 ) . 3 . Co n s id e rs t h e file a s s u b d ivid e d in p a g e s o f d a t a ( 4 , 0 9 6 b yt e s p e r p a g e ) a n d d e rive s , fro m t h e file p o in t e r *ppos, t h e lo g ica l n u m b e r index o f t h e p a g e in clu d in g t h e firs t re q u e s t e d b yt e . Als o s t o re s in offset t h e d is p la ce m e n t in s id e t h e p a g e o f t h e firs t re q u e s t e d b yt e . 4 . Ch e cks wh e t h e r t h e file p o in t e r is in s id e t h e re a d - a h e a d win d o w o f t h e file . We d e fe r d is cu s s in g re a d - a h e a d u n t il t h e la t e r s e ct io n S e ct io n 1 5 . 1 . 2 . 5 . S t a rt s a cycle t o re a d a ll p a g e s t h a t in clu d e t h e re q u e s t e d desc->count b yt e s . Du rin g a s in g le it e ra t io n , t h e fu n ct io n t ra n s fe rs a p a g e o f d a t a b y p e rfo rm in g t h e fo llo win g s u b s t e p s : a . If index*4096+offset e xce e d s t h e file s ize s t o re d in t h e i_size fie ld o f t h e in o d e o b je ct , it e xit s fro m t h e cycle a n d g o e s t o S t e p 6 . b . Lo o ks u p t h e p a g e ca ch e t o fin d t h e p a g e t h a t s t o re s t h e re q u e s t e d d a t a . Re m e m b e r t h a t t h e p a g e ca ch e is e s s e n t ia lly a h a s h t a b le in d e xe d b y t h e a d d re s s o f t h e address_space o b je ct a n d t h e d is p la ce m e n t o f t h e p a g e in s id e t h e file ( index) .

c. If t h e p a g e is n o t fo u n d in s id e t h e p a g e ca ch e , a llo ca t e s a n e w p a g e fra m e a n d in s e rt s it in t o t h e p a g e ca ch e b y in vo kin g add_to_page_cache( ) ( s e e S e ct io n 1 4 . 1 . 3 ) . Re m e m b e r t h a t t h e

PG_uptodate fla g o f t h e p a g e is cle a re d , wh ile t h e PG_locked fla g is s e t . Th e fu n ct io n ju m p s t o Ste p 5h. d . He re t h e p a g e h a s b e e n fo u n d in t h e p a g e ca ch e . Th e fu n ct io n in cre m e n t s t h e u s a g e co u n t e r o f t h e p a g e d e s crip t o r. e . Ch e cks t h e PG_uptodate fla g o f t h e p a g e ; if it is s e t , t h e d a t a s t o re d in t h e p a g e is u p - t o - d a t e . Th e fu n ct io n ju m p s t o S t e p 5 j.

f. In vo ke s generic_file_readahead( ) t o co n s id e r a ct iva t in g fu rt h e r re a d - a h e a d o p e ra t io n s o n t h e file . As we 'll s e e in t h e la t e r s e ct io n S e ct io n 1 5 . 1 . 2 , t h is fu n ct io n co u ld t rig g e r I/ O d a t a t ra n s fe rs fo r s o m e o t h e r b lo cks in t h e p a g e . Ho we ve r, we m a y s a fe ly ig n o re t h e is s u e rig h t n o w. g . Th e d a t a o n t h e p a g e is n o t va lid , s o it m u s t b e re a d fro m d is k. Th e fu n ct io n g a in s e xclu s ive a cce s s t o t h e p a g e b y s e t t in g t h e PG_locked fla g . Of co u rs e , t h e p a g e m ig h t b e a lre a d y lo cke d if a p re vio u s ly s t a rt e d I/ O d a t a t ra n s fe r is n o t ye t t e rm in a t e d ; in t h is ca s e , it s le e p s u n t il t h e p a g e is u n lo cke d , a n d t h e n ch e cks t h e PG_uptodate fla g a g a in in ca s e a n o t h e r d a t a t ra n s fe r h a s p e rfo rm e d t h e n e ce s s a ry re a d . If t h e fla g is n o w s e t t o 1 , t h e fu n ct io n ju m p s t o S t e p 5 j. Ot h e rwis e , t h e fu n ct io n co n t in u e s t o p e rfo rm t h e re a d . h . In vo ke s t h e readpage m e t h o d o f t h e address_space o b je ct o f t h e file . Th e co rre s p o n d in g fu n ct io n t a ke s ca re o f a ct iva t in g t h e I/ O d a t a t ra n s fe r fro m t h e d is k t o t h e p a g e . We d is cu s s la t e r wh a t t h is fu n ct io n d o e s fo r re g u la r file s a n d b lo ck d e vice file s . i. Ch e cks t h e PG_uptodate fla g o f t h e p a g e . If t h e I/ O d a t a t ra n s fe r is n o t a lre a d y co m p le t e d , t h e fla g is s t ill cle a re d , s o t h e fu n ct io n in vo ke s a g a in t h e generic_file_readahead( ) fu n ct io n a n d wa it s u n t il t h e I/ O d a t a t ra n s fe r co m p le t e s . j. Th e p a g e co n t a in s u p - t o - d a t e d a t a . Th e fu n ct io n in vo ke s generic_file_readahead( ) t o co n s id e r a ct iva t in g fu rt h e r re a d - a h e a d o p e ra t io n s o n t h e file . As we 'll s e e in t h e la t e r s e ct io n S e ct io n 1 5 . 1 . 2 , t h is fu n ct io n co u ld t rig g e r I/ O d a t a t ra n s fe rs fo r s o m e o t h e r b lo cks in t h e p a g e . k. In vo ke s mark_page_accessed( ) t o s e t t h e PG_referenced fla g , wh ich d e n o t e s t h a t t h e p a g e is a ct ive ly u s e d a n d s h o u ld n o t b e s wa p p e d o u t ( s e e Ch a p t e r 1 6 ) . Th is is d o n e o n ly if t h e p a g e h a s b e e n e xp licit ly re q u e s t e d b y t h e u s e r ( t h e ke rn e l is n o t p e rfo rm in g re a d - a h e a d ) . l. No w it is t im e t o co p y t h e d a t a o n t h e p a g e in t h e Us e r Mo d e b u ffe r. To d o t h is , do_generic_file_read( ) in vo ke s t h e file_read_actor( ) fu n ct io n , wh o s e a d d re s s h a s b e e n p a s s e d a s a p a ra m e t e r o f t h e fu n ct io n . In t u rn , file_read_actor( ) t a ke s o n e o f t h e s t e p s s h o wn in t h e fo llo win g lis t . a . In vo ke s kmap( ), wh ich e s t a b lis h e s a p e rm a n e n t ke rn e l m a p p in g fo r t h e p a g e if it is in h ig h m e m o ry ( s e e S e ct io n 7 . 1 . 6 ) . b . In vo ke s _ _copy_to_user( ), wh ich co p ie s t h e d a t a o n t h e p a g e in t h e Us e r Mo d e a d d re s s s p a ce ( s e e S e ct io n 9 . 2 . 5 ) . No t ice t h a t t h is o p e ra t io n m ig h t b lo ck t h e p ro ce s s . c. In vo ke s kunmap( ) t o re le a s e a n y p e rm a n e n t ke rn e l m a p p in g o f t h e p a g e .

d . Up d a t e s t h e count, written, a n d buf fie ld s o f t h e read_descriptor_t d e s crip t o r.

m . Up d a t e s t h e index a n d offset lo ca l va ria b le s a cco rd in g t o t h e n u m b e r o f b yt e s e ffe ct ive ly t ra n s fe rre d in t h e Us e r Mo d e b u ffe r. n . De cre m e n t s t h e p a g e d e s crip t o r u s a g e co u n t e r. o . If t h e count fie ld o f t h e read_descriptor_t d e s crip t o r is n o t n u ll a n d a ll re q u e s t e d b yt e s in t h e p a g e h a ve b e e n s u cce s s fu lly t ra n s fe rre d in t o t h e Us e r Mo d e a d d re s s s p a ce , co n t in u e s t h e lo o p , wit h t h e n e xt p a g e o f d a t a in t h e file ju m p in g t o S t e p 5 a . 6 . As s ig n s t o *ppos t h e va lu e index*4096+offset, t h u s s t o rin g t h e n e xt p o s it io n wh e re a re a d is t o o ccu r fo r a fu t u re in vo ca t io n o f t h is fu n ct io n . 7 . S e t s t h e f_reada fie ld o f t h e file d e s crip t o r t o 1 t o re co rd t h e fa ct t h a t d a t a is b e in g re a d s e q u e n t ia lly fro m t h e file ( s e e t h e la t e r s e ct io n S e ct io n 1 5 . 1 . 2 ) . 8 . In vo ke s update_atime( ) t o s t o re t h e cu rre n t t im e in t h e i_atime fie ld o f t h e file 's in o d e a n d t o m a rk t h e in o d e a s d irt y.

15.1.1.1 The readpage method for regular files As we s a w in t h e p re vio u s s e ct io n , t h e readpage m e t h o d is u s e d re p e a t e d ly b y do_generic_file_read( ) t o re a d in d ivid u a l p a g e s fro m d is k in t o m e m o ry. Th e readpage m e t h o d o f t h e address_space o b je ct s t o re s t h e a d d re s s o f t h e fu n ct io n t h a t e ffe ct ive ly a ct iva t e s t h e I/ O d a t a t ra n s fe r fro m t h e p h ys ica l d is k t o t h e p a g e ca ch e . Fo r re g u la r file s , t h is fie ld t yp ica lly p o in t s t o a wra p p e r t h a t in vo ke s t h e block_read_full_page( ) fu n ct io n . Fo r in s t a n ce , t h e readpage m e t h o d o f t h e Ext 2 file s ys t e m is im p le m e n t e d b y t h e fo llo win g fu n ct io n :

int ext2_readpage(struct file *file, struct page *page) { return block_read_full_page(page, ext2_get_block); } Th e wra p p e r is n e e d e d b e ca u s e t h e block_read_full_page( ) fu n ct io n re ce ive s a s p a ra m e t e rs t h e d e s crip t o r

page o f t h e p a g e t o b e fille d a n d t h e a d d re s s get_block o f a fu n ct io n t h a t h e lp s block_read_full_page( ) fin d t h e rig h t b lo ck. Th is fu n ct io n t ra n s la t e s t h e b lo ck n u m b e rs re la t ive t o t h e b e g in n in g o f t h e file in t o lo g ica l b lo ck n u m b e rs re la t ive t o p o s it io n s o f t h e b lo ck in t h e d is k p a rt it io n ( fo r a n e xa m p le , s e e Ch a p t e r 1 7 ) . Of co u rs e , t h e la t t e r p a ra m e t e r d e p e n d s o n t h e t yp e o f file s ys t e m t o wh ich t h e re g u la r file b e lo n g s ; in t h e p re vio u s e xa m p le , t h e p a ra m e t e r is t h e a d d re s s o f t h e ext2_get_block( ) fu n ct io n .

Th e block_read_full_page( ) fu n ct io n s t a rt s a p a g e I/ O o p e ra t io n o n t h e b u ffe rs in clu d e d in t h e p a g e . It a llo ca t e s a n y n e ce s s a ry b u ffe r h e a d s , fin d s t h e b u ffe rs o n d is k u s in g t h e get_block m e t h o d d e s crib e d e a rlie r, a n d t ra n s fe rs t h e d a t a . S p e cifica lly, it p e rfo rm s t h e fo llo win g s t e p s : 1 . Ch e cks t h e page->buffers fie ld ; if it is NULL, in vo ke s create_empty_buffers( ) t o a llo ca t e a s yn ch ro n o u s b u ffe r h e a d s fo r a ll b u ffe rs in clu d e d in t h e p a g e ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) . Th e a d d re s s o f t h e b u ffe r h e a d fo r t h e firs t b u ffe r in t h e p a g e is s t o re d in t h e page->buffers fie ld . Th e b_this_page fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e b u ffe r h e a d o f t h e n e xt b u ffe r in t h e p a g e . 2 . De rive s fro m t h e file o ffs e t re la t ive t o t h e p a g e ( page->index fie ld ) t h e file b lo ck n u m b e r o f t h e firs t b lo ck in t h e p a g e . 3 . Fo r e a ch b u ffe r h e a d o f t h e b u ffe rs in t h e p a g e , p e rfo rm s t h e fo llo win g s u b s t e p s : a . If t h e BH_Uptodate fla g is s e t , s kip s t h e b u ffe r a n d co n t in u e s wit h t h e n e xt b u ffe r in t h e p a g e .

b . If t h e BH_Mapped fla g is n o t s e t , in vo ke s t h e file s ys t e m - d e p e n d e n t fu n ct io n wh o s e a d d re s s h a s b e e n p a s s e d a s a p a ra m e t e r ca lle d get_block. Th e fu n ct io n lo o ks in t h e o n - d is k d a t a s t ru ct u re s o f t h e file s ys t e m a n d fin d s t h e lo g ica l b lo ck n u m b e r o f t h e b u ffe r ( re la t ive t o t h e b e g in n in g o f t h e d is k p a rt it io n ra t h e r t h a n t h e b e g in n in g o f t h e re g u la r file ) . Th e file s ys t e m - d e p e n d e n t fu n ct io n s t o re s t h is n u m b e r in t h e b_blocknr fie ld o f t h e co rre s p o n d in g b u ffe r h e a d , a n d s e t s it s BH_Mapped fla g . In ra re ca s e s , t h e file s ys t e m - d e p e n d e n t fu n ct io n m ig h t n o t fin d t h e b lo ck, e ve n if t h e b lo ck b e lo n g s t o t h e re g u la r file , b e ca u s e t h e a p p lica t io n m ig h t h a ve le ft a h o le in t h a t lo ca t io n ( s e e S e ct io n 1 7 . 6 . 4 ) . In t h is ca s e , block_read_full_page( ) fills t h e b u ffe r wit h 0 's , s e t s t h e BH_Uptodate fla g o f t h e co rre s p o n d in g b u ffe r h e a d , a n d co n t in u e s wit h t h e n e xt b u ffe r in t h e p a g e . c. Te s t s a g a in t h e BH_Uptodate fla g b e ca u s e t h e file s ys t e m - d e p e n d e n t fu n ct io n co u ld h a ve t rig g e re d a b lo ck I/ O o p e ra t io n t h a t u p d a t e d t h e b u ffe r. If BH_Uptodate is s e t , co n t in u e s wit h t h e n e xt b u ffe r in t h e p a g e . d . S t o re s t h e a d d re s s o f t h e b u ffe r h e a d in t h e arr lo ca l a rra y, a n d co n t in u e s wit h t h e n e xt b u ffe r in the pa ge . 4 . No w t h e arr lo ca l a rra y s t o re s t h e a d d re s s e s o f t h e b u ffe r h e a d s t h a t co rre s p o n d t o t h e b u ffe rs wh o s e co n t e n t is n o t u p - t o - d a t e . If t h e a rra y is e m p t y, a ll b u ffe rs in t h e p a g e a re va lid . S o t h e fu n ct io n s e t s t h e PG_uptodate fla g o f t h e p a g e d e s crip t o r, u n lo cks t h e p a g e , a n d t e rm in a t e s .

5 . Th e arr lo ca l a rra y is n o t e m p t y. Fo r e a ch b u ffe r h e a d in t h e a rra y, block_read_full_page( ) p e rfo rm s t h e fo llo win g s u b s t e p s : a . S e t s t h e BH_Lock fla g . If t h e fla g wa s a lre a d y s e t , t h e fu n ct io n wa it s u n t il t h e b u ffe r is re le a s e d .

b . S e t s t h e b_end_io fie ld o f t h e b u ffe r h e a d t o t h e a d d re s s o f t h e end_buffer_io_async( ) fu n ct io n ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) . c. S e t s t h e BH_Async fla g o f t h e b u ffe r h e a d .

6 . Fo r e a ch b u ffe r h e a d in t h e arr lo ca l a rra y, in vo ke s t h e submit_bh( ) fu n ct io n o n it , s p e cifyin g t h e o p e ra t io n t yp e READ. As we s a w in S e ct io n 1 3 . 4 . 6 , t h is fu n ct io n t rig g e rs t h e I/ O d a t a t ra n s fe r o f t h e co rre s p o n d in g b lo ck.

15.1.1.2 The readpage method for block device files In S e ct io n 1 3 . 2 . 3 a n d S e ct io n 1 3 . 4 . 5 . 2 , we d is cu s s e d h o w t h e ke rn e l h a n d le s re q u e s t s t o o p e n a b lo ck d e vice file . We s a w h o w t h e ke rn e l a llo ca t e s a d e s crip t o r o f t yp e block_device fo r a n y n e wly o p e n e d d e vice d rive r a n d in s e rt s it in t o a h a s h t a b le . Th e bd_inode fie ld o f t h e d e s crip t o r p o in t s t o a b lo ck d e vice in o d e t h a t b e lo n g s t o t h e b d e v s p e cia l file s ys t e m ( s e e S e ct io n 1 3 . 4 . 1 ) . Ea ch I/ O o p e ra t io n o n t h e b lo ck d e vice re fe rs t o t h is in o d e , ra t h e r t h a n t o t h e in o d e o f t h e b lo ck d e vice file t h a t wa s s p e cifie d in t h e open( ) s ys t e m ca ll. ( Re m e m b e r t h a t d iffe re n t d e vice file s m ig h t re fe r t o t h e s a m e b lo ck d e vice . ) Blo ck d e vice s u s e a n address_space o b je ct t h a t is s t o re d in t h e i_data fie ld o f t h e co rre s p o n d in g b lo ck d e vice in o d e . Un like re g u la r file s — wh o s e readpage m e t h o d in t h e address_space o b je ct d e p e n d s o n t h e file s ys t e m t yp e t o wh ich t h e file b e lo n g s — t h e readpage m e t h o d o f b lo ck d e vice file s is a lwa ys t h e s a m e . It is im p le m e n t e d b y t h e blkdev_readpage( ) fu n ct io n , wh ich ca lls block_read_full_page( ):

int blkdev_readpage(struct file * file, struct * page page) { return block_read_full_page(page, blkdev_get_block); } As yo u s e e , t h e fu n ct io n is o n ce a g a in a wra p p e r fo r t h e block_read_full_page( ) fu n ct io n d e s crib e d in t h e p re vio u s s e ct io n . Th is t im e t h e s e co n d p a ra m e t e r p o in t s t o a fu n ct io n t h a t m u s t t ra n s la t e t h e file b lo ck n u m b e r re la t ive t o t h e b e g in n in g o f t h e file in t o a lo g ica l b lo ck n u m b e r re la t ive t o t h e b e g in n in g o f t h e b lo ck d e vice . Fo r b lo ck d e vice file s , h o we ve r, t h e t wo n u m b e rs co in cid e ; t h e re fo re , t h e blkdev_get_block( ) fu n ct io n p e rfo rm s t h e fo llo win g s t e p s : 1 . Ch e cks wh e t h e r t h e n u m b e r o f t h e firs t b lo ck in t h e p a g e e xce e d s t h e s ize o f t h e b lo ck d e vice ( s t o re d in blk_size[MAJOR(inode->i_rdev)][MINOR(inode->i_rdev)], s e e S e ct io n 1 3 . 4 . 2 ) . If s o , re t u rn s t h e e rro r co d e -EIO.

2 . S e t s t h e b_dev fie ld o f t h e b u ffe r h e a d t o inode->r_dev.

3 . S e t s t h e b_blocknr fie ld o f t h e b u ffe r h e a d t o t h e file b lo ck n u m b e r o f t h e firs t b lo ck in t h e p a g e .

4 . S e t s t h e BH_Mapped fla g o f t h e b u ffe r h e a d t o s t a t e t h a t t h e b_dev a n d b_blocknr fie ld s o f t h e b u ffe r h e a d a re s ig n ifica n t .

15.1.2 Read-Ahead of Files Ma n y d is k a cce s s e s a re s e q u e n t ia l. As we s h a ll s e e in Ch a p t e r 1 7 , re g u la r file s a re s t o re d o n d is k in la rg e g ro u p s o f a d ja ce n t s e ct o rs , s o t h a t t h e y ca n b e re t rie ve d q u ickly wit h fe w m o ve s o f t h e d is k h e a d s . Wh e n a p ro g ra m re a d s o r co p ie s a file , it o ft e n a cce s s e s it s e q u e n t ia lly, fro m t h e firs t b yt e t o t h e la s t o n e . Th e re fo re , m a n y a d ja ce n t s e ct o rs o n d is k a re like ly t o b e fe t ch e d in s e ve ra l I/ O o p e ra t io n s . Re a d - a h e a d is a t e ch n iq u e t h a t co n s is t s o f re a d in g s e ve ra l a d ja ce n t p a g e s o f d a t a o f a re g u la r file o r b lo ck d e vice

file , b e fo re t h e y a re a ct u a lly re q u e s t e d . In m o s t ca s e s , re a d - a h e a d s ig n ifica n t ly e n h a n ce s d is k p e rfo rm a n ce , s in ce it le t s t h e d is k co n t ro lle r h a n d le fe we r co m m a n d s , e a ch o f wh ich re fe rs t o a la rg e r ch u n k o f a d ja ce n t s e ct o rs . Mo re o ve r, it im p ro ve s s ys t e m re s p o n s ive n e s s . A p ro ce s s t h a t is s e q u e n t ia lly re a d in g a file d o e s n o t u s u a lly wa it fo r t h e re q u e s t e d d a t a b e ca u s e it is a lre a d y a va ila b le in RAM. Ho we ve r, re a d - a h e a d is o f n o u s e t o ra n d o m a cce s s e s t o file s ; in t h is ca s e , it is a ct u a lly d e t rim e n t a l s in ce it t e n d s t o wa s t e s p a ce in t h e p a g e ca ch e wit h u s e le s s in fo rm a t io n . Th e re fo re , t h e ke rn e l s t o p s re a d - a h e a d wh e n it d e t e rm in e s t h a t t h e m o s t re ce n t ly is s u e d I/ O a cce s s is n o t s e q u e n t ia l t o t h e p re vio u s o n e . Re a d - a h e a d o f file s re q u ire s a s o p h is t ica t e d a lg o rit h m fo r s e ve ra l re a s o n s : ●







S in ce d a t a is re a d p a g e b y p a g e , t h e re a d - a h e a d a lg o rit h m d o e s n o t h a ve t o co n s id e r t h e o ffs e t s in s id e t h e p a g e , b u t o n ly t h e p o s it io n s o f t h e a cce s s e d p a g e s in s id e t h e file . A s e rie s o f a cce s s e s t o p a g e s o f t h e s a m e file is co n s id e re d s e q u e n t ia l if t h e re la t e d p a g e s a re clo s e t o e a ch o t h e r. We 'll d e fin e t h e wo rd "clo s e " m o re p re cis e ly in a m o m e n t . Re a d - a h e a d m u s t b e re s t a rt e d fro m s cra t ch wh e n t h e cu rre n t a cce s s is n o t s e q u e n t ia l wit h re s p e ct t o t h e p re vio u s o n e ( ra n d o m a cce s s ) . Re a d - a h e a d s h o u ld b e s lo we d d o wn o r e ve n s t o p p e d wh e n a p ro ce s s ke e p s a cce s s in g t h e s a m e p a g e s o ve r a n d o ve r a g a in ( o n ly a s m a ll p o rt io n o f t h e file is b e in g u s e d ) . If n e ce s s a ry, t h e re a d - a h e a d a lg o rit h m m u s t a ct iva t e t h e lo w- le ve l I/ O d e vice d rive r t o m a ke s u re t h a t t h e n e w p a g e s will u lt im a t e ly b e re a d .

Th e re a d - a h e a d a lg o rit h m id e n t ifie s a s e t o f p a g e s t h a t co rre s p o n d t o a co n t ig u o u s p o rt io n o f t h e file a s t h e re a d a h e a d w in d o w . If t h e n e xt re a d o p e ra t io n is s u e d b y a p ro ce s s fa lls in s id e t h is s e t o f p a g e s , t h e ke rn e l co n s id e rs t h e file a cce s s "s e q u e n t ia l" t o t h e p re vio u s o n e . Th e re a d - a h e a d win d o w co n s is t s o f p a g e s re q u e s t e d b y t h e p ro ce s s o r re a d in a d va n ce b y t h e ke rn e l a n d in clu d e d in t h e p a g e ca ch e . Th e re a d - a h e a d win d o w a lwa ys in clu d e s t h e p a g e s re q u e s t e d in t h e la s t re a d - a h e a d o p e ra t io n ; t h e y a re ca lle d t h e re a d - a h e a d g ro u p . If t h e n e xt o p e ra t io n is s u e d b y a p ro ce s s fa lls in s id e t h e re a d - a h e a d g ro u p , t h e ke rn e l m ig h t re a d in a d va n ce s o m e o f t h e p a g e s fo llo win g t h e re a d a h e a d win d o w ju s t t o e n s u re t h a t t h e ke rn e l will b e "a h e a d " o f t h e re a d in g p ro ce s s . No t a ll t h e p a g e s in t h e re a d a h e a d win d o w o r g ro u p a re n e ce s s a rily u p t o d a t e . Th e y a re in va lid ( i. e . , t h e ir PG_uptodate fla g s a re cle a re d ) if t h e ir t ra n s fe r fro m d is k is n o t ye t co m p le t e d . Th e file o b je ct in clu d e s t h e fo llo win g fie ld s re la t e d t o re a d - a h e a d :

f_raend Po s it io n o f t h e firs t b yt e a ft e r t h e re a d - a h e a d g ro u p a n d t h e re a d - a h e a d win d o w

f_rawin Le n g t h in b yt e s o f t h e cu rre n t re a d - a h e a d win d o w

f_ralen Le n g t h in b yt e s o f t h e cu rre n t re a d - a h e a d g ro u p

f_ramax Ma xim u m n u m b e r o f ch a ra ct e rs t o g e t in t h e n e xt re a d - a h e a d o p e ra t io n

f_reada Fla g s p e cifyin g wh e t h e r t h e file p o in t e r h a s b e e n s e t e xp licit ly b y a lseek( ) s ys t e m ca ll ( if va lu e is 0 ) o r im p licit ly b y a p re vio u s read( ) s ys t e m ca ll ( if va lu e is 1 )

Wh e n a file is o p e n e d , a ll t h e s e fie ld s a re s e t t o 0 . Fig u re 1 5 - 1 illu s t ra t e s h o w s o m e o f t h e fie ld s a re u s e d t o d e lim it t h e re a d - a h e a d win d o w a n d t h e re a d - a h e a d g ro u p . Fig u re 1 5 - 1 . Re a d - a h e a d w in d o w a n d re a d - a h e a d g ro u p

Th e ke rn e l d is t in g u is h e s t wo kin d s o f re a d - a h e a d o p e ra t io n s : S y n ch ro n o u s re a d - a h e a d o p e ra t io n Pe rfo rm e d wh e n e ve r a re a d a cce s s fa lls o u t s id e t h e cu rre n t re a d - a h e a d win d o w o f a file . Th e s yn ch ro n o u s re a d - a h e a d o p e ra t io n u s u a lly a ffe ct s a ll p a g e s re q u e s t e d b y t h e u s e r in t h e re a d o p e ra t io n p lu s o n e . Aft e r t h e o p e ra t io n , t h e re a d - a h e a d win d o w co in cid e s wit h t h e re a d - a h e a d g ro u p ( s e e Fig u re 1 5 - 2 ) . As y n ch ro n o u s re a d - a h e a d o p e ra t io n Pe rfo rm e d wh e n e ve r a re a d a cce s s fa lls in s id e t h e cu rre n t re a d - a h e a d g ro u p o f a file . Th e a s yn ch ro n o u s re a d - a h e a d o p e ra t io n u s u a lly t rie s t o s h ift fo rwa rd a n d t o e n la rg e t h e re a d - a h e a d win d o w o f t h e file b y re a d in g fro m d is k t wice a s m a n y p a g e s a s t h e le n g t h o f t h e p re vio u s re a d - a h e a d g ro u p . Th e n e w re a d a h e a d win d o w s p a n s t h e o ld re a d - a h e a d g ro u p a n d t h e n e w o n e ( s e e Fig u re 1 5 - 2 ) . Fig u re 1 5 - 2 . Re a d - a h e a d g ro u p a n d w in d o w

To e xp la in h o w re a d - a h e a d wo rks , le t 's s u p p o s e a u s e r is s u e s a read( ) s ys t e m ca ll o n a file . Th e

do_generic_file_read( ) fu n ct io n ch e cks wh e t h e r t h e firs t p a g e t o b e re a d fa lls in s id e t h e cu rre n t re a d - a h e a d win d o w o f t h e file ( S t e p 4 in S e ct io n 1 5 . 1 . 1 ) . Th re e ca s e s a re co n s id e re d :



Th e firs t p a g e t o b e re a d fa lls o u t s id e t h e cu rre n t re a d - a h e a d win d o w. Th e fu n ct io n s e t s t h e f_raend,

f_ralen, f_ramax, a n d f_rawin fie ld s o f t h e file o b je ct t o 0 . Mo re o ve r, it d is a b le s a s yn ch ro n o u s re a d a h e a d o p e ra t io n s b y s e t t in g t h e reada_ok lo ca l va ria b le t o 0 . ●



Th e firs t p a g e t o b e re a d fa lls in s id e t h e cu rre n t re a d - a h e a d win d o w. Th is m e a n s t h a t t h e u s e r is a cce s s in g t h e file s e q u e n t ia lly. Th e fu n ct io n e n a b le s a s yn ch ro n o u s re a d - a h e a d o p e ra t io n s b y s e t t in g t h e reada_ok lo ca l va ria b le t o 1 . Th e cu rre n t re a d - a h e a d win d o w a n d g ro u p s a re e m p t y b e ca u s e t h e file wa s n e ve r a cce s s e d b e fo re ; m o re o ve r, t h e firs t p a g e t o b e re a d is t h e in it ia l p a g e o f t h e file . In t h is s p e cia l ca s e , t h e fu n ct io n e n a b le s a s yn ch ro n o u s re a d - a h e a d o p e ra t io n s b y s e t t in g t h e reada_ok lo ca l va ria b le t o 1 .

Th e do_generic_file_read( ) fu n ct io n a ls o a d ju s t s t h e va lu e s t o re d in t h e f_ramax fie ld o f t h e file o b je ct , wh ich re p re s e n t s t h e n u m b e r o f p a g e s t o b e re q u e s t e d in t h e n e xt re a d - a h e a d o p e ra t io n . Alt h o u g h it s va lu e is d e t e rm in e d b y t h e p re vio u s re a d - a h e a d o p e ra t io n o n t h e file ( if a n y) , do_generic_file_read( ) e n s u re s t h a t

f_ramax is a lwa ys g re a t e r t h a n t h e n u m b e r o f p a g e s re q u e s t e d in t h e read( ) s ys t e m ca ll p lu s 1 . Mo re o ve r, t h e fu n ct io n e n s u re s t h a t f_ramax is a lwa ys g re a t e r t h a n t h e va lu e s t o re d in t h e vm_min_readahead g lo b a l va ria b le ( u s u a lly t h re e p a g e s ) a n d s m a lle r t h a n a p e r- d e vice u p p e r b o u n d . Ea ch b lo ck d e vice m a y d e fin e t h is u p p e r b o u n d b y s t o rin g a va lu e in t o t h e max_readahead a rra y, wh ich is in d e xe d b y t h e m a jo r a n d m in o r n u m b e r o f t h e d e vice . If t h e d rive r d o e s n o t s p e cify a n u p p e r b o u n d , t h e ke rn e l u s e s t h e u p p e r b o u n d s t o re d in t h e vm_max_readahead g lo b a l va ria b le ( u s u a lly 3 1 p a g e s ) . S ys t e m a d m in is t ra t o rs m a y t u n e t h e va lu e s in vm_min_readahead a n d vm_max_readahead b y writ in g in t o t h e / p ro c/ s y s / v m / m in - re a d a h e a d a n d / p ro c/ s y s / v m / m a x - re a d a h e a d file s , re s p e ct ive ly. [ 2 ] [2]

A s p e cia l h e u ris t ic a p p lie s fo r read( ) s ys t e m ca lls t h a t a ffe ct o n ly t h e firs t h a lf o f t h e in it ia l p a g e o f t h e file . In t h is ca s e , t h e do_generic_file_read( ) fu n ct io n s e t s t h e f_ramax fie ld t o 0 . Th e id e a is t h a t if a u s e r re a d s o n ly a s m a ll n u m b e r o f ch a ra ct e rs a t t h e b e g in n in g o f t h e file , t h e n s h e is n o t re a lly in t e re s t e d in s e q u e n t ia lly a cce s s in g t h e wh o le file , s o re a d - a h e a d o p e ra t io n s a re u s e le s s . We s a w in t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 t h a t t h e do_generic_file_read( ) fu n ct io n in vo ke s t h e

generic_file_readahead( ) fu n ct io n s e ve ra l t im e s , a t le a s t o n ce fo r e a ch p a g e in vo lve d in t h e re a d re q u e s t . Th e fu n ct io n re ce ive s a s p a ra m e t e rs t h e file a n d in o d e o b je ct s , t h e d e s crip t o r o f t h e p a g e cu rre n t ly co n s id e re d b y

do_generic_file_read( ), a n d t h e va lu e o f t h e reada_ok fla g , wh ich e n a b le s o r d is a b le s a s yn ch ro n o u s re a d a h e a d o p e ra t io n s . To re a d a h e a d a p a g e , t h e generic_file_readahead( ) fu n ct io n in vo ke s page_cache_read( ), wh ich lo o ks u p ( a n d o p t io n a lly in s e rt s ) t h e p a g e in t h e p a g e ca ch e a n d t h e n in vo ke s t h e readpage m e t h o d o f t h e co rre s p o n d in g address_space o b je ct t o re q u e s t t h e I/ O d a t a t ra n s fe r.

Th e o ve ra ll s ch e m e o f generic_file_readahead( ) is s h o wn in Fig u re 1 5 - 3 . Ba s ica lly, t h e fu n ct io n d is t in g u is h e s t wo ca s e s : s yn ch ro n o u s a n d a s yn ch ro n o u s . It ch e cks t h e p a g e d e s crip t o r p a s s e d a s it s p a ra m e t e r. If t h e PG_locked fla g in t h is d e s crip t o r is s e t , t h e p a g e is m o s t like ly s t ill in vo lve d in t h e I/ O d a t a t ra n s fe r t rig g e re d b y t h e do_generic_file_read( ) fu n ct io n a n d a n y re a d - a h e a d m u s t b e s yn ch ro n o u s . Ot h e rwis e , a s yn ch ro n o u s re a d - a h e a d is p o s s ib le . We e xa m in e t h e a ct io n s b a s e d o n t h e PG_locked fla g in t h e fo llo win g s e ct io n s . Fig u re 1 5 - 3 . Ov e ra ll s c h e m e o f t h e g e n e ric _ file _ re a d a h e a d ( ) fu n c t io n

15.1.2.1 The accessed page is locked (synchronous read-ahead) In t h is ca s e , generic_file_readahead( ) m a y t a ke t h re e d iffe re n t co u rs e s o f a ct io n :



Wh e n t h e re a d a cce s s is n o t s e q u e n t ia l wit h re s p e ct t o t h e p re vio u s o n e ( t h a t is , e it h e r t h e re a d - a h e a d g ro u p is e m p t y, o r t h e a cce s s e d p a g e is o u t s id e t h e re a d - a h e a d win d o w) a n d f_ramax is n o t n u ll, t h e fu n ct io n p e rfo rm s a s yn ch ro n o u s re a d - a h e a d o p e ra t io n a s fo llo ws : ❍

Re a d s f_ramax p a g e s s t a rt in g fro m t h e p a g e fo llo win g t h e a cce s s e d o n e .



S e t s t h e n e w re a d - a h e a d win d o w a n d t h e n e w re a d - a h e a d g ro u p t o co n t a in t h e f_ramax p a g e s ju s t re a d a n d t h e p a g e re fe re n ce d b y t h e do_generic_file_read( ) fu n ct io n .





Do u b le s t h e va lu e s t o re d in f_ramax ( b u t a llo ws it t o b e co m e n o la rg e r t h a n t h e u p p e r b o u n d

d e fin e d b y t h e b lo ck d e vice ) . Wh e n a s yn ch ro n o u s re a d - a h e a d o p e ra t io n is like ly t o b e p e rfo rm e d , b u t t h e f_ramax fie ld is s e t t o 0 , t h e

generic_file_readahead( ) fu n ct io n re s e t s t h e re a d - a h e a d win d o w a n d t h e re a d - a h e a d g ro u p a s fo llo ws : ❍ ❍

Th e re a d - a h e a d win d o w in clu d e s ju s t t h e a cce s s e d p a g e , s o it s s ize is s e t t o 1 . Th e re a d - a h e a d g ro u p is s e t t o b e t h e s a m e a s t h e re a d - a h e a d win d o w.

Re m e m b e r t h a t do_generic_file_read( ) s e t s f_ramax t o 0 wh e n t h e u s e r re q u e s t s t h e firs t fe w ch a ra ct e rs o f a file . ●

If t h e a cce s s e d p a g e fa lls in s id e t h e n o n - n u ll re a d - a h e a d win d o w, t h e fu n ct io n d o e s n o t h in g . S in ce t h e p a g e is lo cke d , t h e co rre s p o n d in g I/ O d a t a t ra n s fe rs a re s t ill t o b e fin is h e d , s o it is p o in t le s s t o s t a rt a n a d d it io n a l re a d o p e ra t io n .

15.1.2.2 The accessed page is unlocked (asynchronous read-ahead) If t h e p a g e a cce s s e d b y t h e ca lle r do_generic_file_read( ) fu n ct io n is u n lo cke d , t h e co rre s p o n d in g I/ O d a t a t ra n s fe rs h a ve m o s t like ly fin is h e d . In t h is ca s e , generic_file_readahead( ) m a y t a ke t wo d iffe re n t co u rs e s o f a ct io n : ●

Wh e n s e ve ra l co n d it io n s a re s a t is fie d , t h e fu n ct io n p e rfo rm s a n a s yn ch ro n o u s re a d - a h e a d o p e ra t io n . Th e s e co n d it io n s a re a s fo llo ws : a s yn ch ro n o u s re a d - a h e a d o p e ra t io n s a re e n a b le d , t h e re a d - h e a d g ro u p is n o t e m p t y a n d t h e a cce s s e d p a g e fa lls in t o it , a n d t h e f_ramax fie ld is n o t n u ll. Th e fu n ct io n d o e s t h e fo llo win g :





Re a d s f_ramax+1 p a g e s s t a rt in g fro m f_raend



S e t s t h e n e w re a d - a h e a d win d o w t o in clu d e t h e p re vio u s re a d - a h e a d g ro u p a n d t h e f_ramax+1



p a g e s ju s t re a d S e t s t h e n e w re a d - a h e a d g ro u p t o in clu d e t h e f_ramax+1 p a g e s ju s t re a d



Do u b le s t h e va lu e s t o re d in f_ramax ( b u t a llo ws it t o b e co m e n o la rg e r t h a n t h e u p p e r b o u n d

d e fin e d b y t h e b lo ck d e vice ) Th e fu n ct io n d o e s n o t h in g wh e n e ve r t h e fu n ct io n ca n n o t s t a rt a n a s yn ch ro n o u s re a d - a h e a d o p e ra t io n — fo r in s t a n ce , wh e n t h e re a d o p e ra t io n is n o t s e q u e n t ia l wit h re s p e ct t o t h e p re vio u s o n e ( t h e a s yn ch ro n o u s re a d - a h e a d is d is a b le d b y do_generic_file_read( )) , o r wh e n t h e a cce s s is s e q u e n t ia l b u t t h e a cce s s e d p a g e fa lls in s id e t h e re a d - a h e a d win d o w a n d o u t s id e t h e re a d - a h e a d g ro u p ( i. e . , t h e p ro ce s s is la g g in g wit h re s p e ct t o re a d - a h e a d ) .

15.1.3 Writing to a File Re ca ll t h a t t h e write( ) s ys t e m ca ll in vo lve s m o vin g d a t a fro m t h e Us e r Mo d e a d d re s s s p a ce o f t h e ca llin g p ro ce s s in t o t h e ke rn e l d a t a s t ru ct u re s , a n d t h e n t o d is k. Th e write m e t h o d o f t h e file o b je ct p e rm it s e a ch file s ys t e m t yp e t o d e fin e a s p e cia lize d writ e o p e ra t io n . In Lin u x 2 . 4 , t h e write m e t h o d o f e a ch d is k- b a s e d file s ys t e m is a p ro ce d u re t h a t b a s ica lly id e n t ifie s t h e d is k b lo cks in vo lve d in t h e writ e o p e ra t io n , co p ie s t h e d a t a fro m t h e Us e r Mo d e a d d re s s s p a ce in t o s o m e p a g e s b e lo n g in g t o t h e p a g e ca ch e , a n d m a rks t h e b u ffe rs in t h o s e p a g e s a s d irt y. S e ve ra l file s ys t e m s ( s u ch a s Ext 2 ) im p le m e n t t h e write m e t h o d o f t h e file o b je ct b y m e a n s o f t h e

generic_file_write( ) fu n ct io n , wh ich a ct s o n t h e fo llo win g p a ra m e t e rs : file File o b je ct p o in t e r

buf Ad d re s s wh e re t h e ch a ra ct e rs t o b e writ t e n in t o t h e file m u s t b e fe t ch e d

count Nu m b e r o f ch a ra ct e rs t o b e writ t e n

ppos Ad d re s s o f a va ria b le s t o rin g t h e file o ffs e t fro m wh ich writ in g m u s t s t a rt Th e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Ve rifie s t h a t t h e p a ra m e t e rs count a n d buf a re va lid ( t h e y m u s t re fe r t o t h e Us e r Mo d e a d d re s s s p a ce ) ; if n o t , re t u rn s t h e e rro r co d e -EFAULT.

2 . De t e rm in e s t h e a d d re s s inode o f t h e in o d e o b je ct t h a t co rre s p o n d s t o t h e file t o b e writ t e n ( file-

>f_dentry->d_inode->i_mapping->host) . 3 . Acq u ire s t h e s e m a p h o re inode->i_sem. Th a n ks t o t h is s e m a p h o re , o n ly o n e p ro ce s s a t a t im e ca n is s u e a

write( ) s ys t e m ca ll o n t h e file . 4 . If t h e O_APPEND fla g o f file->flags is o n a n d t h e file is re g u la r ( n o t a b lo ck d e vice file ) , s e t s *ppos t o t h e e n d o f t h e file s o t h a t a ll n e w d a t a is a p p e n d e d t o it . 5 . Pe rfo rm s s e ve ra l ch e cks o n t h e s ize o f t h e file . Fo r in s t a n ce , t h e writ e o p e ra t io n m u s t n o t e n la rg e a re g u la r file s o m u ch a s t o e xce e d t h e p e r- u s e r lim it s t o re d in current->rlim[RLIMIT_FSIZE] ( s e e S e ct io n 3 . 2 . 5 ) a n d t h e file s ys t e m lim it s t o re d in inode->i_sb->s_maxbytes.

6 . S t o re s t h e cu rre n t t im e o f d a y in t h e inode->mtime fie ld ( t h e t im e o f la s t file writ e o p e ra t io n ) a n d in t h e

inode->mtime fie ld ( t h e t im e o f la s t in o d e ch a n g e ) , a n d m a rks t h e in o d e o b je ct a s d irt y. 7 . Ch e cks t h e va lu e o f t h e O_DIRECT fla g o f t h e file o b je ct . If it is s e t , t h e writ e o p e ra t io n b yp a s s e s t h e p a g e ca ch e . We d is cu s s t h is ca s e la t e r in t h is ch a p t e r. In t h e re s t o f t h is s e ct io n , we a s s u m e t h a t O_DIRECT is not se t. 8 . S t a rt s a cycle t o u p d a t e a ll t h e p a g e s o f t h e file in vo lve d in t h e writ e o p e ra t io n . Du rin g e a ch it e ra t io n , p e rfo rm s t h e fo llo win g s u b s t e p s : a . Trie s t o fin d t h e p a g e in t h e p a g e ca ch e . If it is n 't t h e re , a llo ca t e s a fre e p a g e a n d a d d s it t o t h e p a g e ca ch e . b . Lo cks t h e p a g e — t h a t is , s e t s it s PG_locked fla g .

c. In cre m e n t s t h e p a g e u s a g e co u n t e r a s a fa il- s a fe m e ch a n is m . d . In vo ke s kmap( ) t o g e t t h e s t a rt in g lin e a r a d d re s s o f t h e p a g e ( s e e S e ct io n 7 . 1 . 6 ) .

e . In vo ke s t h e prepare_write m e t h o d o f t h e address_space o b je ct o f t h e in o d e ( file-

>f_dentry->d_inode->i_mapping) . Th e co rre s p o n d in g fu n ct io n t a ke s ca re o f a llo ca t in g

a s yn ch ro n o u s b u ffe r h e a d s fo r t h e p a g e a n d o f re a d in g s o m e b u ffe rs fro m d is k, if n e ce s s a ry. We 'll d is cu s s in s u b s e q u e n t s e ct io n s wh a t t h is fu n ct io n d o e s fo r re g u la r file s a n d b lo ck d e vice file s . f. In vo ke s _ _copy_from_user( ) t o co p y t h e ch a ra ct e rs fro m t h e b u ffe r in Us e r Mo d e t o t h e p a g e .

g . In vo ke s t h e commit_write m e t h o d o f t h e address_space o b je ct o f t h e in o d e ( file->f_dentry-

>d_inode->i_mapping) . Th e co rre s p o n d in g fu n ct io n m a rks t h e u n d e rlyin g b u ffe rs a s d irt y s o t h e y a re writ t e n t o d is k la t e r. We d is cu s s wh a t t h is fu n ct io n d o e s fo r re g u la r file s a n d b lo ck d e vice file s in t h e n e xt t wo s e ct io n s . h . In vo ke s kunmap( ) t o re le a s e a n y p e rm a n e n t h ig h - m e m o ry m a p p in g e s t a b lis h e d in S t e p 8 d .

i. S e t s t h e PG_referenced fla g o f t h e p a g e ; it is u s e d b y t h e m e m o ry re cla im in g a lg o rit h m d e s crib e d in Ch a p t e r 1 6 . j. Cle a rs t h e PG_locked fla g , a n d wa ke s u p a n y p ro ce s s t h a t is wa it in g fo r t h e p a g e t o u n lo ck.

k. De cre m e n t s t h e p a g e u s a g e co u n t e r t o u n d o t h e in cre m e n t in S t e p 8 c. 9 . No w a ll p a g e s o f t h e file in vo lve d in t h e writ e o p e ra t io n h a ve b e e n h a n d le d . Up d a t e s t h e va lu e o f *ppos t o p o in t rig h t a ft e r t h e la s t ch a ra ct e r writ t e n . 1 0 . Ch e cks wh e t h e r t h e O_SYNC fla g o f t h e file is s e t . If s o , in vo ke s generic_osync_inode( ) t o fo rce t h e ke rn e l t o flu s h a ll d irt y b u ffe rs o f t h e p a g e t o d is k, b lo ckin g t h e cu rre n t p ro ce s s u n t il t h e I/ O d a t a t ra n s fe rs t e rm in a t e . In Ve rs io n 2 . 4 . 1 8 o f Lin u x, t h is fu n ct io n o ve r- ice s t h e ca ke b e ca u s e it flu s h e s t o d is k a ll d irt y b u ffe rs o f t h e file , n o t ju s t t h o s e b e lo n g in g t o t h e file p o rt io n ju s t writ t e n . 1 1 . Re le a s e s t h e inode->i_sem s e m a p h o re .

1 2 . Re t u rn s t h e n u m b e r o f ch a ra ct e rs writ t e n in t o t h e file .

15.1.3.1 The prepare_write and commit_write methods for regular files Th e prepare_write a n d commit_write m e t h o d s o f t h e address_space o b je ct s p e cia lize t h e g e n e ric writ e o p e ra t io n im p le m e n t e d b y generic_file_write( ) fo r re g u la r file s a n d b lo ck d e vice file s . Bo t h o f t h e m a re in vo ke d o n ce fo r e ve ry p a g e o f t h e file t h a t is a ffe ct e d b y t h e writ e o p e ra t io n . Ea ch d is k- b a s e d file s ys t e m d e fin e s it s o wn prepare_write m e t h o d . As wit h re a d o p e ra t io n s , t h is m e t h o d is s im p ly a wra p p e r fo r a co m m o n fu n ct io n . Fo r in s t a n ce , t h e Ext 2 file s ys t e m im p le m e n t s t h e prepare_write m e t h o d b y m e a n s o f t h e fo llo win g fu n ct io n :

int ext2_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { return block_prepare_write(page,from,to,ext2_get_block); } Th e ext2_get_block( ) fu n ct io n wa s a lre a d y m e n t io n e d in t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 ; it t ra n s la t e s t h e b lo ck n u m b e r re la t ive t o t h e file in t o a lo g ica l b lo ck n u m b e r, wh ich re p re s e n t s t h e p o s it io n o f t h e d a t a o n t h e p h ys ica l b lo ck d e vice . Th e block_prepare_write( ) fu n ct io n t a ke s ca re o f p re p a rin g t h e b u ffe rs a n d t h e b u ffe r h e a d s o f t h e file 's p a g e b y p e rfo rm in g t h e fo llo win g s t e p s : 1 . Ch e cks t h e page->buffers fie ld ; if it is NULL, t h e fu n ct io n in vo ke s create_empty_buffers( ) t o a llo ca t e b u ffe r h e a d s fo r a ll b u ffe rs in clu d e d in t h e p a g e ( s e e S e ct io n 1 3 . 4 . 8 . 2 ) . Th e a d d re s s o f t h e b u ffe r h e a d fo r t h e firs t b u ffe r in t h e p a g e is s t o re d in t h e page->buffers fie ld . Th e b_this_page fie ld o f e a ch b u ffe r h e a d p o in t s t o t h e b u ffe r h e a d o f t h e n e xt b u ffe r in t h e p a g e . 2 . Fo r e a ch b u ffe r h e a d re la t ive t o a b u ffe r in clu d e d in t h e p a g e a n d a ffe ct e d b y t h e writ e o p e ra t io n , t h e

fo llo win g is p e rfo rm e d : a . If t h e BH_Mapped fla g is n o t s e t , t h e fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s :

1 . In vo ke s t h e file s ys t e m - d e p e n d e n t fu n ct io n wh o s e a d d re s s wa s p a s s e d a s a p a ra m e t e r. Th e fu n ct io n lo o ks in t h e o n - d is k d a t a s t ru ct u re s o f t h e file s ys t e m a n d fin d s t h e lo g ica l b lo ck n u m b e r o f t h e b u ffe r ( re la t ive t o t h e b e g in n in g o f t h e d is k p a rt it io n ra t h e r t h a n t h e b e g in n in g o f t h e re g u la r file ) . Th e file s ys t e m - d e p e n d e n t fu n ct io n s t o re s t h is n u m b e r in t h e b_blocknr fie ld o f t h e co rre s p o n d in g b u ffe r h e a d a n d s e t s it s BH_Mapped fla g . Th e file s ys t e m - s p e cific fu n ct io n co u ld a llo ca t e a n e w p h ys ica l b lo ck fo r t h e file ( fo r in s t a n ce , if t h e a cce s s e d b lo ck fa lls in s id e a "h o le " o f t h e re g u la r file , s e e s e ct io n S e ct io n 1 7 . 6 . 4 ) . In t h is ca s e , it s e t s t h e BH_New fla g .

2 . Ch e cks t h e va lu e o f t h e BH_New fla g ; if it is s e t , in vo ke s unmap_underlying_metadata(

) t o m a ke s u re t h a t t h e b u ffe r ca ch e d o e s n o t in clu d e a d irt y b u ffe r re fe re n cin g t h e s a m e b lo ck o n d is k. [ 3 ] Mo re o ve r, if t h e writ e o p e ra t io n d o e s n o t re writ e t h e wh o le b u ffe r, t h e fu n ct io n fills it wit h 0 's . Th e n co n s id e rs t h e n e xt b u ffe r in t h e p a g e . [3]

Alt h o u g h u n like ly, t h is ca s e m ig h t h a p p e n if a n o t h e r b lo ck in t h e s a m e b u ffe r p a g e wa s p re vio u s ly a cce s s e d b y m e a n s o f a b lo ck I/ O o p e ra t io n ( wh ich ca u s e d o u r b u ffe r h e a d t o b e in s e rt e d in t h e b u ffe r ca ch e ; s e e S e ct io n 1 4 . 2 . 2 ) , a n d if in a d d it io n a u s e r wro t e in t o o u r b lo ck b y a cce s s in g t h e co rre s p o n d in g b lo ck d e vice file , t h u s m a kin g it d irt y. 3 . If t h e writ e o p e ra t io n d o e s n o t re writ e t h e wh o le b u ffe r a n d it s BH_Uptodate fla g is n o t s e t , t h e fu n ct io n in vo ke s ll_rw_block( ) o n t h e b lo ck t o re a d it s co n t e n t fro m d is k ( s e e S e ct io n 1 3 . 4 . 6 ) .



Blo cks t h e cu rre n t p ro ce s s u n t il a ll re a d o p e ra t io n s t rig g e re d in S t e p 2 b h a ve b e e n co m p le t e d .

On ce t h e prepare_write m e t h o d re t u rn s , t h e generic_file_write( ) fu n ct io n u p d a t e s t h e p a g e wit h t h e d a t a s t o re d in t h e Us e r Mo d e a d d re s s s p a ce . Ne xt , it in vo ke s t h e commit_write m e t h o d o f t h e address_space o b je ct . Th is m e t h o d is im p le m e n t e d b y t h e generic_commit_write( ) fu n ct io n fo r a lm o s t a ll d is k- b a s e d file s ys t e m s . Th e generic_commit_write( ) fu n ct io n p e rfo rm s t h e fo llo win g s t e p s :

1 . In vo ke s t h e block_commit_write( ) fu n ct io n . In t u rn , t h is fu n ct io n co n s id e rs a ll b u ffe rs in t h e p a g e t h a t a re a ffe ct e d b y t h e writ e o p e ra t io n ; fo r e a ch o f t h e m , it s e t s t h e BH_Uptodate a n d BH_Dirty fla g s a n d in s e rt s t h e b u ffe r h e a d in t h e BUF_DIRTY lis t a n d in t h e lis t o f d irt y b u ffe rs o f t h e in o d e ( if it is n o t a lre a d y in t h e lis t ) . Th e fu n ct io n a ls o in vo ke s t h e balance_dirty( ) fu n ct io n t o ke e p t h e n u m b e r o f d irt y b u ffe rs in t h e s ys t e m b o u n d e d ( s e e S e ct io n 1 4 . 2 . 4 ) . 2 . Ch e cks wh e t h e r t h e writ e o p e ra t io n e n la rg e d t h e file . In t h is ca s e , t h e fu n ct io n u p d a t e s t h e i_size fie ld o f t h e file 's in o d e a n d m a rks t h e in o d e o b je ct a s d irt y.

15.1.3.2 The prepare_write and commit_write methods for block device files Writ e o p e ra t io n s in t o b lo ck d e vice file s a re ve ry s im ila r t o t h e co rre s p o n d in g o p e ra t io n s o n re g u la r file s . In fa ct , t h e prepare_write m e t h o d o f t h e address_space o b je ct o f b lo ck d e vice file s is u s u a lly im p le m e n t e d b y t h e fo llo win g fu n ct io n :

int blkdev_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { return block_prepare_write(page, from, to, blkdev_get_block); } As yo u s e e , t h e fu n ct io n is s im p ly a wra p p e r t o t h e block_prepare_write( ) fu n ct io n a lre a d y d is cu s s e d in t h e p re vio u s s e ct io n . Th e o n ly d iffe re n ce , o f co u rs e , is in t h e s e co n d p a ra m e t e r, wh ich p o in t s t o t h e fu n ct io n t h a t m u s t t ra n s la t e t h e file b lo ck n u m b e r re la t ive t o t h e b e g in n in g o f t h e file t o a lo g ica l b lo ck n u m b e r re la t ive t o t h e

b e g in n in g o f t h e b lo ck d e vice . Re m e m b e r t h a t fo r b lo ck d e vice file s , t h e t wo n u m b e rs co in cid e . ( S e e t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 . 2 fo r a d is cu s s io n o f t h e blkdev_get_block( ) fu n ct io n . )

Th e commit_write m e t h o d fo r b lo ck d e vice file s is im p le m e n t e d b y t h e fo llo win g s im p le wra p p e r fu n ct io n :

int blkdev_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { return block_commit_write(page, from, to); } As yo u s e e , t h e commit_write m e t h o d fo r b lo ck d e vice file s d o e s e s s e n t ia lly t h e s a m e t h in g s a s t h e

commit_write m e t h o d fo r re g u la r file s ( we d e s crib e d t h e block_commit_write( ) fu n ct io n in t h e p re vio u s s e ct io n ) . Th e o n ly d iffe re n ce is t h a t t h e m e t h o d d o e s n o t ch e ck wh e t h e r t h e writ e o p e ra t io n h a s e n la rg e d t h e file ; yo u s im p ly ca n n o t e n la rg e a b lo ck d e vice file b y a p p e n d in g ch a ra ct e rs t o it s la s t p o s it io n .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

15.2 Memory Mapping As a lre a d y m e n t io n e d in S e ct io n 8 . 3 , a m e m o ry re g io n ca n b e a s s o cia t e d wit h s o m e p o rt io n o f e it h e r a re g u la r file in a d is k- b a s e d file s ys t e m o r a b lo ck d e vice file . Th is m e a n s t h a t a n a cce s s t o a b yt e wit h in a p a g e o f t h e m e m o ry re g io n is t ra n s la t e d b y t h e ke rn e l in t o a n o p e ra t io n o n t h e co rre s p o n d in g b yt e o f t h e file . Th is t e ch n iq u e is ca lle d m e m o ry m a p p in g . Two kin d s o f m e m o ry m a p p in g e xis t : S h a re d An y writ e o p e ra t io n o n t h e p a g e s o f t h e m e m o ry re g io n ch a n g e s t h e file o n d is k; m o re o ve r, if a p ro ce s s writ e s in t o a p a g e o f a s h a re d m e m o ry m a p p in g , t h e ch a n g e s a re vis ib le t o a ll o t h e r p ro ce s s e s t h a t m a p t h e s a m e file . Priv a t e Me a n t t o b e u s e d wh e n t h e p ro ce s s cre a t e s t h e m a p p in g ju s t t o re a d t h e file , n o t t o writ e it . Fo r t h is p u rp o s e , p riva t e m a p p in g is m o re e fficie n t t h a n s h a re d m a p p in g . Bu t a n y writ e o p e ra t io n o n a p riva t e ly m a p p e d p a g e will ca u s e it t o s t o p m a p p in g t h e p a g e in t h e file . Th u s , a writ e d o e s n o t ch a n g e t h e file o n d is k, n o r is t h e ch a n g e vis ib le t o a n y o t h e r p ro ce s s e s t h a t a cce s s t h e s a m e file . A p ro ce s s ca n cre a t e a n e w m e m o ry m a p p in g b y is s u in g a n mmap( ) s ys t e m ca ll ( s e e S e ct io n 1 5 . 2 . 2 la t e r in t h is ch a p t e r) . Pro g ra m m e rs m u s t s p e cify e it h e r t h e MAP_SHARED fla g o r t h e MAP_PRIVATE fla g a s a p a ra m e t e r o f t h e s ys t e m ca ll; a s yo u ca n e a s ily g u e s s , in t h e fo rm e r ca s e t h e m a p p in g is s h a re d , wh ile in t h e la t t e r it is p riva t e . On ce t h e m a p p in g is cre a t e d , t h e p ro ce s s ca n re a d t h e d a t a s t o re d in t h e file b y s im p ly re a d in g fro m t h e m e m o ry lo ca t io n s o f t h e n e w m e m o ry re g io n . If t h e m e m o ry m a p p in g is s h a re d , t h e p ro ce s s ca n a ls o m o d ify t h e co rre s p o n d in g file b y s im p ly writ in g in t o t h e s a m e m e m o ry lo ca t io n s . To d e s t ro y o r s h rin k a m e m o ry m a p p in g , t h e p ro ce s s m a y u s e t h e munmap( ) s ys t e m ca ll ( s e e t h e la t e r s e ct io n S e ct io n 1 5 . 2 . 3 ) . As a g e n e ra l ru le , if a m e m o ry m a p p in g is s h a re d , t h e co rre s p o n d in g m e m o ry re g io n h a s t h e VM_SHARED fla g s e t ; if it is p riva t e , t h e VM_SHARED fla g is cle a re d . As we 'll s e e la t e r, a n e xce p t io n t o t h is ru le e xis t s fo r re a d - o n ly s h a re d m e m o ry m a p p in g s .

15.2.1 Memory Mapping Data Structures A m e m o ry m a p p in g is re p re s e n t e d b y a co m b in a t io n o f t h e fo llo win g d a t a s t ru ct u re s : ● ● ● ● ●

Th e in o d e o b je ct a s s o cia t e d wit h t h e m a p p e d file Th e address_space o b je ct o f t h e m a p p e d file A file o b je ct fo r e a ch d iffe re n t m a p p in g p e rfo rm e d o n t h e file b y d iffe re n t p ro ce s s e s A vm_area_struct d e s crip t o r fo r e a ch d iffe re n t m a p p in g o n t h e file A p a g e d e s crip t o r fo r e a ch p a g e fra m e a s s ig n e d t o a m e m o ry re g io n t h a t m a p s t h e file

Fig u re 1 5 - 4 illu s t ra t e s h o w t h e d a t a s t ru ct u re s a re lin ke d . In t h e u p p e r- le ft co rn e r, we s h o w

t h e in o d e , wh ich id e n t ifie s t h e file . Th e i_mapping fie ld o f e a ch in o d e o b je ct p o in t s t o t h e

address_space o b je ct o f t h e file . In t u rn , t h e i_mmap o r i_mmap_shared fie ld s o f e a ch address_space o b je ct p o in t t o t h e firs t e le m e n t o f a d o u b ly lin ke d lis t t h a t in clu d e s a ll m e m o ry re g io n s t h a t cu rre n t ly m a p t h e file ; if b o t h fie ld s a re NULL, t h e file is n o t m a p p e d b y a n y m e m o ry re g io n . Th e lis t co n t a in s vm_area_struct d e s crip t o rs t h a t re p re s e n t m e m o ry re g io n s , a n d is im p le m e n t e d b y m e a n s o f t h e vm_next_share a n d vm_pprev_share fie ld s . Fig u re 1 5 - 4 . D a t a s t ru c t u re s fo r file m e m o ry m a p p in g

Th e vm_file fie ld o f e a ch m e m o ry re g io n d e s crip t o r co n t a in s t h e a d d re s s o f a file o b je ct fo r t h e m a p p e d file ; if t h a t fie ld is n u ll, t h e m e m o ry re g io n is n o t u s e d in a m e m o ry m a p p in g . Th e file o b je ct co n t a in s fie ld s t h a t a llo w t h e ke rn e l t o id e n t ify b o t h t h e p ro ce s s t h a t o wn s t h e m e m o ry m a p p in g a n d t h e file b e in g m a p p e d . Th e p o s it io n o f t h e firs t m a p p e d lo ca t io n is s t o re d in t o t h e vm_pgoff fie ld o f t h e m e m o ry re g io n d e s crip t o r; it re p re s e n t s t h e file o ffs e t a s a n u m b e r o f p a g e - s ize u n it s . Th e le n g t h o f t h e m a p p e d file p o rt io n is s im p ly t h e le n g t h o f t h e m e m o ry re g io n , wh ich ca n b e co m p u t e d fro m t h e vm_start a n d vm_end fie ld s .

Pa g e s o f s h a re d m e m o ry m a p p in g s a re a lwa ys in clu d e d in t h e p a g e ca ch e ; p a g e s o f p riva t e m e m o ry m a p p in g s a re in clu d e d in t h e p a g e ca ch e a s lo n g a s t h e y a re u n m o d ifie d . Wh e n a p ro ce s s t rie s t o m o d ify a p a g e o f a p riva t e m e m o ry m a p p in g , t h e ke rn e l d u p lica t e s t h e p a g e fra m e a n d re p la ce s t h e o rig in a l p a g e fra m e wit h t h e d u p lica t e in t h e p ro ce s s Pa g e Ta b le ; t h is

is o n e o f t h e a p p lica t io n s o f t h e Co p y On Writ e m e ch a n is m t h a t we d is cu s s e d in Ch a p t e r 8 . Th e o rig in a l p a g e fra m e s t ill re m a in s in t h e p a g e ca ch e , a lt h o u g h it n o lo n g e r b e lo n g s t o t h e m e m o ry m a p p in g s in ce it is re p la ce d b y t h e d u p lica t e . In t u rn , t h e d u p lica t e is n o t in s e rt e d in t o t h e p a g e ca ch e s in ce it n o lo n g e r co n t a in s va lid d a t a re p re s e n t in g t h e file o n d is k. Fig u re 1 5 - 4 a ls o s h o ws a fe w p a g e d e s crip t o rs o f p a g e s in clu d e d in t h e p a g e ca ch e t h a t re fe r t o t h e m e m o ry- m a p p e d file . No t ice t h a t t h e firs t m e m o ry re g io n in t h e fig u re is t h re e p a g e s lo n g , b u t o n ly t wo p a g e fra m e s a re a llo ca t e d fo r it ; p re s u m a b ly, t h e p ro ce s s o wn in g t h e m e m o ry re g io n h a s n e ve r a cce s s e d t h e t h ird p a g e . Alt h o u g h n o t s h o wn in t h e fig u re , t h e p a g e d e s crip t o rs a re in s e rt e d in t o t h e clean_pages, dirty_pages, a n d locked_pages d o u b ly lin ke d lis t s d e s crib e d in S e ct io n 1 4 . 1 . 2 . Th e ke rn e l o ffe rs s e ve ra l h o o ks t o cu s t o m ize t h e m e m o ry m a p p in g m e ch a n is m fo r e ve ry d iffe re n t file s ys t e m . Th e co re o f m e m o ry m a p p in g im p le m e n t a t io n is d e le g a t e d t o a file o b je ct 's m e t h o d n a m e d mmap. Fo r m o s t d is k- b a s e d file s ys t e m s a n d fo r b lo ck d e vice file s , t h is m e t h o d is im p le m e n t e d b y a g e n e ra l fu n ct io n ca lle d generic_file_mmap( ), wh ich is d e s crib e d in t h e n e xt s e ct io n . File m e m o ry m a p p in g d e p e n d s o n t h e d e m a n d p a g in g m e ch a n is m d e s crib e d in S e ct io n 8 . 4 . 3 . In fa ct , a n e wly e s t a b lis h e d m e m o ry m a p p in g is a m e m o ry re g io n t h a t d o e s n 't in clu d e a n y p a g e ; a s t h e p ro ce s s re fe re n ce s a n a d d re s s in s id e t h e re g io n , a Pa g e Fa u lt o ccu rs a n d t h e Pa g e Fa u lt h a n d le r ch e cks wh e t h e r t h e nopage m e t h o d o f t h e m e m o ry re g io n is d e fin e d . If nopage is n o t d e fin e d , t h e m e m o ry re g io n d o e s n 't m a p a file o n d is k; o t h e rwis e , it d o e s , a n d t h e m e t h o d t a ke s ca re o f re a d in g t h e p a g e b y a cce s s in g t h e b lo ck d e vice . Alm o s t a ll d is kb a s e d file s ys t e m s a n d b lo ck d e vice file s im p le m e n t t h e nopage m e t h o d b y m e a n s o f t h e

filemap_nopage( ) fu n ct io n . 15.2.2 Creating a Memory Mapping To cre a t e a n e w m e m o ry m a p p in g , a p ro ce s s is s u e s a n mmap( ) s ys t e m ca ll, p a s s in g t h e fo llo win g p a ra m e t e rs t o it : ● ● ● ●

A file d e s crip t o r id e n t ifyin g t h e file t o b e m a p p e d . An o ffs e t in s id e t h e file s p e cifyin g t h e firs t ch a ra ct e r o f t h e file p o rt io n t o b e m a p p e d . Th e le n g t h o f t h e file p o rt io n t o b e m a p p e d . A s e t o f fla g s . Th e p ro ce s s m u s t e xp licit ly s e t e it h e r t h e MAP_SHARED fla g o r t h e

MAP_PRIVATE fla g t o s p e cify t h e kin d o f m e m o ry m a p p in g re q u e s t e d . [ 4 ] [4]

Th e p ro ce s s co u ld a ls o s e t t h e MAP_ANONYMOUS fla g t o s p e cify t h a t t h e n e w m e m o ry re g io n is a n o n ym o u s — t h a t is , n o t a s s o cia t e d wit h a n y d is k- b a s e d file ( s e e S e ct io n 8 . 4 . 3 ) . Th is fla g is s u p p o rt e d b y s o m e Un ix o p e ra t in g s ys t e m s , in clu d in g Lin u x, b u t it is n o t d e fin e d b y t h e POS IX s t a n d a rd . In Lin u x 2 . 4 , a p ro ce s s ca n a ls o cre a t e a m e m o ry re g io n t h a t is b o t h MAP_SHARED a n d MAP_ANONYMOUS. In t h is ca s e , t h e re g io n m a p s a s p e cia l file in t h e s h m file s ys t e m ( s e e S e ct io n 1 9 . 3 . 5 ) , wh ich ca n b e a cce s s e d b y a ll t h e p ro ce s s 's d e s ce n d a n t s .



A s e t o f p e rm is s io n s s p e cifyin g o n e o r m o re t yp e s o f a cce s s t o t h e m e m o ry re g io n : re a d a cce s s ( PROT_READ) , writ e a cce s s ( PROT_WRITE) , o r e xe cu t io n a cce s s ( PROT_EXEC) .



An o p t io n a l lin e a r a d d re s s , wh ich is t a ke n b y t h e ke rn e l a s a h in t o f wh e re t h e n e w m e m o ry re g io n s h o u ld s t a rt . If t h e MAP_FIXED fla g is s p e cifie d a n d t h e ke rn e l ca n n o t a llo ca t e t h e n e w m e m o ry re g io n s t a rt in g fro m t h e s p e cifie d lin e a r a d d re s s , t h e s ys t e m ca ll fa ils .

Th e mmap( ) s ys t e m ca ll re t u rn s t h e lin e a r a d d re s s o f t h e firs t lo ca t io n in t h e n e w m e m o ry re g io n . Fo r co m p a t ib ilit y re a s o n s , in t h e 8 0 x 8 6 a rch it e ct u re , t h e ke rn e l re s e rve s t wo e n t rie s in t h e s ys t e m ca ll t a b le fo r mmap( ): o n e a t in d e x 9 0 a n d t h e o t h e r a t in d e x 1 9 2 . Th e fo rm e r e n t ry co rre s p o n d s t o t h e old_mmap( ) s e rvice ro u t in e ( u s e d b y o ld e r C lib ra rie s ) , wh ile t h e la t t e r o n e co rre s p o n d s t o t h e sys_mmap2( ) s e rvice ro u t in e ( u s e d b y re ce n t C lib ra rie s ) . Th e t wo s e rvice ro u t in e s d iffe r o n ly in h o w t h e s ix p a ra m e t e rs o f t h e s ys t e m ca ll a re p a s s e d . Bo t h o f t h e m e n d u p in vo kin g t h e do_mmap_pgoff( ) fu n ct io n d e s crib e d in S e ct io n 8 . 3 . 4 . We n o w co m p le t e t h a t d e s crip t io n b y d e t a ilin g t h e s t e p s p e rfo rm e d o n ly wh e n cre a t in g a m e m o ry re g io n t h a t m a p s a file . 1 . Ch e cks wh e t h e r t h e mmap file o p e ra t io n fo r t h e file t o b e m a p p e d is d e fin e d ; if n o t , it re t u rn s a n e rro r co d e . A NULL va lu e fo r mmap in t h e file o p e ra t io n t a b le in d ica t e s t h a t t h e co rre s p o n d in g file ca n n o t b e m a p p e d ( fo r in s t a n ce , b e ca u s e it is a d ire ct o ry) . 2 . Ch e cks wh e t h e r t h e get_unmapped_area m e t h o d o f t h e file o b je ct is d e fin e d . If s o , in vo ke s it ; o t h e rwis e , in vo ke s t h e arch_get_unmapped_area( ) fu n ct io n a lre a d y d e s crib e d in Ch a p t e r 8 . On t h e 8 0 x 8 6 a rch it e ct u re , a cu s t o m m e t h o d is u s e d o n ly b y t h e fra m e b u ffe r la ye r, s o we d o n 't d is cu s s t h e ca s e fu rt h e r. Re m e m b e r t h a t t h e arch_get_unmapped_area( ) a llo ca t e s a n in t e rva l o f lin e a r a d d re s s e s fo r t h e n e w m e m o ry re g io n . 3 . In a d d it io n t o t h e u s u a l co n s is t e n cy ch e cks , co m p a re s t h e kin d o f m e m o ry m a p p in g re q u e s t e d a n d t h e fla g s s p e cifie d wh e n t h e file wa s o p e n e d . Th e fla g s p a s s e d a s a p a ra m e t e r o f t h e s ys t e m ca ll s p e cify t h e kin d o f m a p p in g re q u ire d , wh ile t h e va lu e o f t h e f_mode fie ld o f t h e file o b je ct s p e cifie s h o w t h e file wa s o p e n e d . De p e n d in g o n t h e s e t wo s o u rce s o f in fo rm a t io n , it p e rfo rm s t h e fo llo win g ch e cks : a . If a s h a re d writ a b le m e m o ry m a p p in g is re q u ire d , ch e cks t h a t t h e file wa s o p e n e d fo r writ in g a n d t h a t it wa s n o t o p e n e d in a p p e n d m o d e ( O_APPEND fla g o f t h e open( ) s ys t e m ca ll)

b . If a s h a re d m e m o ry m a p p in g is re q u ire d , ch e cks t h a t t h e re is n o m a n d a t o ry lo ck o n t h e file ( s e e S e ct io n 1 2 . 7 ) c. Fo r a n y kin d o f m e m o ry m a p p in g , ch e cks t h a t t h e file wa s o p e n e d fo r re a d in g If a n y o f t h e s e co n d it io n s is n o t fu lfille d , a n e rro r co d e is re t u rn e d . 4 . Wh e n in it ia lizin g t h e va lu e o f t h e vm_flags fie ld o f t h e n e w m e m o ry re g io n d e s crip t o r, s e t s t h e VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, VM_MAYREAD,

VM_MAYWRITE, VM_MAYEXEC, a n d VM_MAYSHARE fla g s a cco rd in g t o t h e a cce s s rig h t s

o f t h e file a n d t h e kin d o f re q u e s t e d m e m o ry m a p p in g ( s e e S e ct io n 8 . 3 . 2 ) . As a n o p t im iza t io n , t h e VM_SHARED fla g is cle a re d fo r n o n writ a b le s h a re d m e m o ry m a p p in g . Th is ca n b e d o n e b e ca u s e t h e p ro ce s s is n o t a llo we d t o writ e in t o t h e p a g e s o f t h e m e m o ry re g io n , s o t h e m a p p in g is t re a t e d t h e s a m e a s a p riva t e m a p p in g ; h o we ve r, t h e ke rn e l a ct u a lly a llo ws o t h e r p ro ce s s e s t h a t s h a re t h e file t o a cce s s t h e p a g e s in t h is m e m o ry re g io n . 5 . In it ia lize s t h e vm_file fie ld o f t h e m e m o ry re g io n d e s crip t o r wit h t h e a d d re s s o f t h e file o b je ct a n d in cre m e n t s t h e file 's u s a g e co u n t e r. 6 . In vo ke s t h e mmap m e t h o d fo r t h e file b e in g m a p p e d , p a s s in g a s p a ra m e t e rs t h e a d d re s s o f t h e file o b je ct a n d t h e a d d re s s o f t h e m e m o ry re g io n d e s crip t o r. Fo r m o s t file s ys t e m s , t h is m e t h o d is im p le m e n t e d b y t h e generic_file_mmap( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s : a . If a s h a re d writ a b le m e m o ry m a p p in g is re q u ire d , ch e cks t h a t t h e writepage m e t h o d o f t h e address_space o b je ct o f t h e file is d e fin e d ; if n o t , it re t u rn s t h e e rro r co d e -EINVAL.

b . Ch e cks t h a t t h e readpage m e t h o d o f t h e address_space o b je ct o f t h e file is d e fin e d ; if n o t , it re t u rn s t h e e rro r co d e -ENOEXEC.

c. S t o re s t h e cu rre n t t im e in t h e i_atime fie ld o f t h e file 's in o d e a n d m a rks t h e in o d e a s d irt y. d . In it ia lize s t h e vm_ops fie ld o f t h e m e m o ry re g io n d e s crip t o r wit h t h e a d d re s s o f t h e generic_file_vm_ops t a b le . All m e t h o d s in t h is t a b le a re n u ll, e xce p t t h e nopage m e t h o d , wh ich is im p le m e n t e d b y t h e filemap_nopage(

) fu n ct io n . 7 . Re ca ll fro m S e ct io n 8 . 3 . 4 t h a t do_mmap( ) in vo ke s vma_link( ). Th is fu n ct io n in s e rt s t h e m e m o ry re g io n d e s crip t o r in t o e it h e r t h e i_mmap lis t o r t h e

i_mmap_shared lis t o f t h e address_space o b je ct , a cco rd in g t o wh e t h e r t h e re q u e s t e d m e m o ry m a p p in g is p riva t e o r s h a re d , re s p e ct ive ly.

15.2.3 Destroying a Memory Mapping Wh e n a p ro ce s s is re a d y t o d e s t ro y a m e m o ry m a p p in g , it in vo ke s t h e munmap( ) s ys t e m ca ll, p a s s in g t h e fo llo win g p a ra m e t e rs t o it : ● ●

Th e a d d re s s o f t h e firs t lo ca t io n in t h e lin e a r a d d re s s in t e rva l t o b e re m o ve d Th e le n g t h o f t h e lin e a r a d d re s s in t e rva l t o b e re m o ve d

No t ice t h a t t h e munmap( ) s ys t e m ca ll ca n b e u s e d t o e it h e r re m o ve o r re d u ce t h e s ize o f e a ch kin d o f m e m o ry re g io n . In d e e d , t h e sys_munmap( ) s e rvice ro u t in e o f t h e s ys t e m ca ll e s s e n t ia lly in vo ke s t h e do_munmap( ) fu n ct io n a lre a d y d e s crib e d in S e ct io n 8 . 3 . 5 . Ho we ve r, if t h e m e m o ry re g io n m a p s a file , t h e fo llo win g a d d it io n a l s t e p s a re p e rfo rm e d fo r e a ch m e m o ry re g io n in clu d e d in t h e ra n g e o f lin e a r a d d re s s e s t o b e re le a s e d :

1 . In vo ke s remove_shared_vm_struct( ) t o re m o ve t h e m e m o ry re g io n d e s crip t o r fro m t h e address_space o b je ct lis t ( e it h e r i_mmap o r i_mmap_shared) .

2 . Wh e n e xe cu t in g t h e unmap_fixup( ) fu n ct io n , d e cre m e n t s t h e file u s a g e co u n t e r if a n e n t ire m e m o ry re g io n is d e s t ro ye d , a n d in cre m e n t s t h e file u s a g e co u n t e r if a n e w m e m o ry re g io n is cre a t e d — t h a t is , if t h e u n m a p p in g cre a t e d a h o le in s id e a re g io n . If t h e re g io n h a s ju s t b e e n s h ru n ke n , it le a ve s t h e file u s a g e co u n t e r u n ch a n g e d . No t ice t h a t t h e re is n o n e e d t o flu s h t o d is k t h e co n t e n t s o f t h e p a g e s in clu d e d in a writ a b le s h a re d m e m o ry m a p p in g t o b e d e s t ro ye d . In fa ct , t h e s e p a g e s co n t in u e t o a ct a s a d is k ca ch e b e ca u s e t h e y a re s t ill in clu d e d in t h e p a g e ca ch e ( s e e t h e n e xt s e ct io n ) .

15.2.4 Demand Paging for Memory Mapping Fo r re a s o n s o f e fficie n cy, p a g e fra m e s a re n o t a s s ig n e d t o a m e m o ry m a p p in g rig h t a ft e r it h a s b e e n cre a t e d a t t h e la s t p o s s ib le m o m e n t —t h a t is , wh e n t h e p ro ce s s a t t e m p t s t o a d d re s s o n e o f it s p a g e s , t h u s ca u s in g a Pa g e Fa u lt e xce p t io n . We s a w in S e ct io n 8 . 4 h o w t h e ke rn e l ve rifie s wh e t h e r t h e fa u lt y a d d re s s is in clu d e d in s o m e m e m o ry re g io n o f t h e p ro ce s s ; if s o , t h e ke rn e l ch e cks t h e Pa g e Ta b le e n t ry co rre s p o n d in g t o t h e fa u lt y a d d re s s a n d in vo ke s t h e do_no_page( ) fu n ct io n if t h e e n t ry is n u ll ( s e e S e ct io n 8 . 4 . 3 ) . Th e do_no_page( ) fu n ct io n p e rfo rm s a ll t h e o p e ra t io n s t h a t a re co m m o n t o a ll t yp e s o f d e m a n d p a g in g , s u ch a s a llo ca t in g a p a g e fra m e a n d u p d a t in g t h e Pa g e Ta b le s . It a ls o ch e cks wh e t h e r t h e nopage m e t h o d o f t h e m e m o ry re g io n in vo lve d is d e fin e d . In S e ct io n 8 . 4 . 3 , we d e s crib e d t h e ca s e in wh ich t h e m e t h o d is u n d e fin e d ( a n o n ym o u s m e m o ry re g io n ) ; n o w we co m p le t e t h e d e s crip t io n b y d is cu s s in g t h e a ct io n s p e rfo rm e d b y t h e fu n ct io n wh e n t h e m e t h o d is d e fin e d : 1 . In vo ke s t h e nopage m e t h o d , wh ich re t u rn s t h e a d d re s s o f a p a g e fra m e t h a t co n t a in s t h e re q u e s t e d p a g e . 2 . If t h e p ro ce s s is t ryin g t o writ e in t o t h e p a g e a n d t h e m e m o ry m a p p in g is p riva t e , a vo id s a fu t u re Co p y On Writ e fa u lt b y m a kin g a co p y o f t h e p a g e ju s t re a d a n d in s e rt in g it in t o t h e in a ct ive lis t o f p a g e s ( s e e Ch a p t e r 1 6 ) . In t h e fo llo win g s t e p s , t h e fu n ct io n u s e s t h e n e w p a g e in s t e a d o f t h e p a g e re t u rn e d b y t h e nopage m e t h o d s o t h a t t h e la t t e r is n o t m o d ifie d b y t h e Us e r Mo d e p ro ce s s . 3 . In cre m e n t s t h e rss fie ld o f t h e p ro ce s s m e m o ry d e s crip t o r t o in d ica t e t h a t a n e w p a g e fra m e h a s b e e n a s s ig n e d t o t h e p ro ce s s . 4 . S e t s u p t h e Pa g e Ta b le e n t ry co rre s p o n d in g t o t h e fa u lt y a d d re s s wit h t h e a d d re s s o f t h e p a g e fra m e a n d t h e p a g e a cce s s rig h t s in clu d e d in t h e m e m o ry re g io n vm_page_prot fie ld .

5 . If t h e p ro ce s s is t ryin g t o writ e in t o t h e p a g e , fo rce s t h e Read/Write a n d Dirty b it s o f t h e Pa g e Ta b le e n t ry t o 1 . In t h is ca s e , e it h e r t h e p a g e fra m e is e xclu s ive ly a s s ig n e d t o t h e p ro ce s s , o r t h e p a g e is s h a re d ; in b o t h ca s e s , writ in g t o it s h o u ld b e

a llo we d . Th e co re o f t h e d e m a n d p a g in g a lg o rit h m co n s is t s o f t h e m e m o ry re g io n 's nopage m e t h o d . Ge n e ra lly s p e a kin g , it m u s t re t u rn t h e a d d re s s o f a p a g e fra m e t h a t co n t a in s t h e p a g e a cce s s e d b y t h e p ro ce s s . It s im p le m e n t a t io n d e p e n d s o n t h e kin d o f m e m o ry re g io n in wh ich t h e p a g e is in clu d e d . Wh e n h a n d lin g m e m o ry re g io n s t h a t m a p file s o n d is k, t h e nopage m e t h o d m u s t firs t s e a rch fo r t h e re q u e s t e d p a g e in t h e p a g e ca ch e . If t h e p a g e is n o t fo u n d , t h e m e t h o d m u s t re a d it fro m d is k. Mo s t file s ys t e m s im p le m e n t t h e nopage m e t h o d b y m e a n s o f t h e

filemap_nopage( ) fu n ct io n , wh ich re ce ive s t h re e p a ra m e t e rs : area De s crip t o r a d d re s s o f t h e m e m o ry re g io n , in clu d in g t h e re q u ire d p a g e .

address Lin e a r a d d re s s o f t h e re q u ire d p a g e .

unused Pa ra m e t e r o f t h e nopage m e t h o d t h a t is n o t u s e d b y filemap_nopage( ).

Th e filemap_nopage( ) fu n ct io n e xe cu t e s t h e fo llo win g s t e p s :

1 . Ge t s t h e file o b je ct a d d re s s file fro m area->vm_file fie ld . De rive s t h e

address_space o b je ct a d d re s s fro m file->f_dentry->d_inode->i_mapping. De rive s t h e in o d e o b je ct a d d re s s fro m t h e host fie ld o f t h e address_space o b je ct . 2 . Us e s t h e vm_start a n d vm_pgoff fie ld s o f area t o d e t e rm in e t h e o ffs e t wit h in t h e file o f t h e d a t a co rre s p o n d in g t o t h e p a g e s t a rt in g fro m address.

3 . Ch e cks wh e t h e r t h e file o ffs e t e xce e d s t h e file s ize . Wh e n t h is h a p p e n s , re t u rn s NULL, wh ich m e a n s fa ilu re in a llo ca t in g t h e n e w p a g e , u n le s s t h e Pa g e Fa u lt wa s ca u s e d b y a d e b u g g e r t ra cin g a n o t h e r p ro ce s s t h ro u g h t h e ptrace( ) s ys t e m ca ll. We a re n o t g o in g t o d is cu s s t h is s p e cia l ca s e . 4 . In vo ke s find_get_page( ) t o lo o k in t h e p a g e ca ch e fo r t h e p a g e id e n t ifie d b y t h e

address_space o b je ct a n d t h e file o ffs e t . 5 . If t h e p a g e is n o t in t h e p a g e ca ch e , ch e cks t h e va lu e o f t h e VM_RAND_READ fla g o f t h e m e m o ry re g io n . Th e va lu e o f t h is fla g ca n b e ch a n g e d b y m e a n s o f t h e madvise( ) s ys t e m ca ll; wh e n t h e fla g is s e t , it in d ica t e s t h a t t h e u s e r a p p lica t io n is n o t g o in g t o re a d m o re p a g e s o f t h e file t h a n t h o s e ju s t a cce s s e d .



If t h e VM_RAND_READ fla g is s e t , in vo ke s page_cache_read( ) t o re a d ju s t

t h e re q u e s t e d p a g e fro m d is k ( s e e t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 ) .



If t h e VM_RAND_READ fla g is cle a re d , in vo ke s page_cache_read( ) s e ve ra l t im e s t o re a d a clu s t e r o f a d ja ce n t p a g e s in s id e t h e m e m o ry re g io n , in clu d in g t h e re q u e s t e d p a g e . Th e le n g t h o f t h e clu s t e r is s t o re d in t h e page_request va ria b le ; it s d e fa u lt va lu e is t h re e p a g e s , b u t t h e s ys t e m a d m in is t ra t o r m a y t u n e it s va lu e b y writ in g in t o t h e / p ro c/ s y s / v m / p a g e - clu s t e r s p e cia l file .

Th e n t h e fu n ct io n ju m p s b a ck t o S t e p 4 a n d re p e a t s t h e p a g e ca ch e lo o ku p o p e ra t io n ( t h e p ro ce s s m ig h t h a ve b e e n b lo cke d wh ile e xe cu t in g t h e page_cache_read( ) fu n ct io n ) . 6 . Th e p a g e is in s id e t h e p a g e ca ch e . Ch e cks it s PG_uptodate fla g . If t h e fla g is n o t s e t ( p a g e n o t u p t o d a t e ) , t h e fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s : a . Lo cks u p t h e p a g e b y s e t t in g t h e PG_locked fla g , s le e p in g if n e ce s s a ry.

b . In vo ke s t h e readpage m e t h o d o f t h e address_space o b je ct t o t rig g e r t h e I/ O d a t a t ra n s fe r. c. In vo ke s wait_on_page( ) t o s le e p u n t il t h e I/ O t ra n s fe r co m p le t e s .

7 . Th e p a g e is u p t o d a t e . Th e fu n ct io n ch e cks t h e VM_SEQ_READ fla g o f t h e m e m o ry re g io n . Th e va lu e o f t h is fla g ca n b e ch a n g e d b y m e a n s o f t h e madvise( ) s ys t e m ca ll; wh e n t h e fla g is s e t , it in d ica t e s t h a t t h e u s e r a p p lica t io n is g o in g t o re fe re n ce t h e p a g e s o f t h e m a p p e d file s e q u e n t ia lly, t h u s t h e p a g e s s h o u ld b e a g g re s s ive ly re a d in a d va n ce a n d fre e d a ft e r t h e y a re a cce s s e d . If t h e fla g is s e t , it in vo ke s nopage_sequential_readahead( ). Th is fu n ct io n u s e s a la rg e , fixe d - s ize re a d a h e a d win d o w, wh o s e le n g t h is a p p ro xim a t e ly t h e m a xim u m re a d - a h e a d win d o w s ize o f t h e u n d e rlyin g b lo ck d e vice ( s e e t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 2 ) . Th e

vm_raend fie ld o f t h e m e m o ry re g io n d e s crip t o r s t o re s t h e e n d in g p o s it io n o f t h e cu rre n t re a d - a h e a d win d o w. Th e fu n ct io n s h ift s t h e re a d - a h e a d win d o ws fo rwa rd ( b y re a d in g in a d va n ce t h e co rre s p o n d in g p a g e s ) wh e n e ve r t h e re q u e s t e d p a g e fa lls e xa ct ly in t h e m id d le p o in t o f t h e cu rre n t re a d - a h e a d win d o w. Mo re o ve r, t h e fu n ct io n s h o u ld re le a s e t h e p a g e s in t h e m e m o ry re g io n t h a t a re fa r b e h in d t h e re q u e s t e d p a g e ; if t h e fu n ct io n re a d s t h e n t h re a d - a h e a d win d o w o f t h e m e m o ry re g io n , it flu s h e s t o d is k t h e p a g e s b e lo n g in g t o t h e ( n - 3 ) t h win d o w ( h o we ve r, t h e ke rn e l Ve rs io n 2 . 4 . 1 8 d o e s n 't re le a s e t h e m ; s e e t h e n e xt s e ct io n ) . 8 . In vo ke s mark_page_accessed( ) t o m a rk t h e re q u e s t e d p a g e a s a cce s s e d ( s e e Ch a p t e r 1 6 ) . 9 . Re t u rn s t h e a d d re s s o f t h e re q u e s t e d p a g e .

15.2.5 Flushing Dirty Memory Mapping Pages to Disk Th e msync( ) s ys t e m ca ll ca n b e u s e d b y a p ro ce s s t o flu s h t o d is k d irt y p a g e s b e lo n g in g t o a s h a re d m e m o ry m a p p in g . It re ce ive s a s p a ra m e t e rs t h e s t a rt in g a d d re s s o f a n in t e rva l o f lin e a r a d d re s s e s , t h e le n g t h o f t h e in t e rva l, a n d a s e t o f fla g s t h a t h a ve t h e fo llo win g m e a n in g s :

MS_SYNC As ks t h e s ys t e m ca ll t o s u s p e n d t h e p ro ce s s u n t il t h e I/ O o p e ra t io n co m p le t e s . In t h is wa y, t h e ca llin g p ro ce s s ca n a s s u m e t h a t wh e n t h e s ys t e m ca ll t e rm in a t e s , a ll p a g e s o f it s m e m o ry m a p p in g h a ve b e e n flu s h e d t o d is k.

MS_ASYNC As ks t h e s ys t e m ca ll t o re t u rn im m e d ia t e ly wit h o u t s u s p e n d in g t h e ca llin g p ro ce s s .

MS_INVALIDATE As ks t h e s ys t e m ca ll t o re m o ve a ll p a g e s in clu d e d in t h e m e m o ry m a p p in g fro m t h e p ro ce s s a d d re s s s p a ce ( n o t re a lly im p le m e n t e d ) . Th e sys_msync( ) s e rvice ro u t in e in vo ke s msync_interval( ) o n e a ch m e m o ry re g io n in clu d e d in t h e in t e rva l o f lin e a r a d d re s s e s . In t u rn , t h e la t t e r fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . If t h e vm_file fie ld o f t h e m e m o ry re g io n d e s crip t o r is NULL, o r if t h e VM_SHARED fla g is cle a r, re t u rn s 0 ( t h e m e m o ry re g io n is n o t a writ a b le s h a re d m e m o ry m a p p in g o f a file ) . 2 . In vo ke s t h e filemap_sync( ) fu n ct io n , wh ich s ca n s t h e Pa g e Ta b le e n t rie s co rre s p o n d in g t o t h e lin e a r a d d re s s in t e rva ls in clu d e d in t h e m e m o ry re g io n . Fo r e a ch p a g e fo u n d , it in vo ke s flush_tlb_page( ) t o flu s h t h e co rre s p o n d in g t ra n s la t io n lo o ka s id e b u ffe rs , a n d m a rks t h e p a g e a s d irt y. 3 . If t h e MS_SYNC fla g is n o t s e t , re t u rn s . Ot h e rwis e , co n t in u e s wit h t h e fo llo win g s t e p s t o flu s h t h e p a g e s in t h e m e m o ry re g io n t o d is k, s le e p in g u n t il a ll I/ O d a t a t ra n s fe rs t e rm in a t e . No t ice t h a t , a t le a s t in t h e la s t s t a b le ve rs io n o f t h e ke rn e l a t t h e t im e o f t h is writ in g , t h e fu n ct io n d o e s n o t t a ke t h e MS_INVALIDATE fla g in t o co n s id e ra t io n .

4 . Acq u ire s t h e i_sem s e m a p h o re o f t h e file 's in o d e .

5 . In vo ke s t h e filemap_fdatasync( ) fu n ct io n , wh ich re ce ive s t h e a d d re s s o f t h e file 's address_space o b je ct . Fo r e ve ry p a g e b e lo n g in g t o t h e d irt y p a g e s lis t o f t h e

address_space o b je ct , t h e fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s : a . Mo ve s t h e p a g e fro m t h e d irt y p a g e s lis t t o t h e lo cke d p a g e s lis t . b . If t h e PG_Dirty fla g is n o t s e t , co n t in u e s wit h t h e n e xt p a g e in t h e lis t ( t h e p a g e is a lre a d y b e in g flu s h e d b y a n o t h e r p ro ce s s ) . c. In cre m e n t s t h e u s a g e co u n t e r o f t h e p a g e a n d lo cks it , s le e p in g if n e ce s s a ry. d . Cle a rs t h e PG_dirty fla g o f t h e p a g e .

e . In vo ke s t h e writepage m e t h o d o f t h e address_space o b je ct o n t h e p a g e ( d e s crib e d fo llo win g t h is lis t ) . f. Re le a s e s t h e u s a g e co u n t e r o f t h e p a g e Th e writepage m e t h o d fo r b lo ck d e vice file s a n d a lm o s t a ll d is k- b a s e d file s ys t e m s is ju s t a wra p p e r fo r t h e block_write_full_page( ) fu n ct io n ; it is u s e d t o p a s s t o block_write_full_page( ) t h e a d d re s s o f a file s ys t e m - d e p e n d e n t fu n ct io n t h a t t ra n s la t e s t h e b lo ck n u m b e rs re la t ive t o t h e b e g in n in g o f t h e file in t o lo g ica l b lo ck n u m b e rs re la t ive t o p o s it io n s o f t h e b lo ck in t h e d is k p a rt it io n . ( Th is is t h e s a m e m e ch a n is m t h a t is a lre a d y d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 a n d t h a t is u s e d fo r t h e readpage m e t h o d ) . In t u rn , block_write_full_page( ) is ve ry s im ila r t o block_read_full_page( ) d e s crib e d e a rlie r: it a llo ca t e s a s yn ch ro n o u s b u ffe r h e a d s fo r t h e p a g e , a n d in vo ke s t h e submit_bh( ) fu n ct io n o n e a ch o f t h e m s p e cifyin g t h e WRITE o p e ra t io n .

6 . Ch e cks wh e t h e r t h e fsync m e t h o d o f t h e file o b je ct is d e fin e d ; if s o , e xe cu t e s it . Fo r re g u la r file s , t h is m e t h o d u s u a lly lim it s it s e lf t o flu s h in g t h e in o d e o b je ct o f t h e file t o d is k. Fo r b lo ck d e vice file s , h o we ve r, t h e m e t h o d in vo ke s sync_buffers( ), wh ich a ct iva t e s t h e I/ O d a t a t ra n s fe r o f a ll d irt y b u ffe rs o f t h e d e vice . 7 . Exe cu t e s t h e filemap_fdatawait( ) fu n ct io n . Fo r e a ch p a g e in t h e lo cke d p a g e s lis t o f t h e address_space o b je ct , t h e fu n ct io n wa it s u n t il t h e p a g e b e co m e s u n lo cke d — wh e n t h e o n g o in g I/ O d a t a t ra n s fe r o n t h e p a g e t e rm in a t e s . 8 . Re le a s e s t h e i_sem s e m a p h o re o f t h e file .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

15.3 Direct I/O Transfers As we h a ve s e e n , in Ve rs io n 2 . 4 o f Lin u x, t h e re is n o s u b s t a n t ia l d iffe re n ce b e t we e n a cce s s in g a re g u la r file t h ro u g h t h e file s ys t e m , a cce s s in g it b y re fe re n cin g it s b lo cks o n t h e u n d e rlyin g b lo ck d e vice file , o r e ve n e s t a b lis h in g a file m e m o ry m a p p in g . Th e re a re , h o we ve r, s o m e h ig h ly s o p h is t ica t e d p ro g ra m s ( s e lf- ca ch in g a p p lica t io n s ) t h a t wo u ld like t o h a ve fu ll co n t ro l o f t h e wh o le I/ O d a t a t ra n s fe r m e ch a n is m . Co n s id e r, fo r e xa m p le , h ig h p e rfo rm a n ce d a t a b a s e s e rve rs : m o s t o f t h e m im p le m e n t t h e ir o wn ca ch in g m e ch a n is m s t h a t e xp lo it t h e p e cu lia r n a t u re o f t h e q u e rie s t o t h e d a t a b a s e . Fo r t h e s e kin d s o f p ro g ra m s , t h e ke rn e l p a g e ca ch e d o e s n 't h e lp ; o n t h e co n t ra ry, it is d e t rim e n t a l fo r t h e fo llo win g re a s o n s : ●





Lo t s o f p a g e fra m e s a re wa s t e d t o d u p lica t e d is k d a t a a lre a d y in RAM ( in t h e u s e rle ve l d is k ca ch e ) Th e read( ) a n d write( ) s ys t e m ca lls a re s lo we d d o wn b y t h e re d u n d a n t in s t ru ct io n s t h a t h a n d le t h e p a g e ca ch e a n d t h e re a d - a h e a d ; d it t o fo r t h e p a g in g o p e ra t io n s re la t e d t o t h e file m e m o ry m a p p in g s Ra t h e r t h a n t ra n s fe rrin g t h e d a t a d ire ct ly b e t we e n t h e d is k a n d t h e u s e r m e m o ry, t h e read( ) a n d write( ) s ys t e m ca lls m a ke t wo t ra n s fe rs : b e t we e n t h e d is k a n d a ke rn e l b u ffe r a n d b e t we e n t h e ke rn e l b u ffe r a n d t h e u s e r m e m o ry

S in ce b lo ck h a rd wa re d e vice s m u s t b e h a n d le d t h ro u g h in t e rru p t s a n d Dire ct Me m o ry Acce s s ( DMA) , a n d t h is ca n b e d o n e o n ly in Ke rn e l Mo d e , s o m e s o rt o f ke rn e l s u p p o rt is d e fin it ive ly re q u ire d t o im p le m e n t s e lf- ca ch in g a p p lica t io n s . Ve rs io n 2 . 4 o f Lin u x o ffe rs a s im p le wa y t o b yp a s s t h e p a g e ca ch e : d ire ct I/ O t ra n s fe rs . In e a ch I/ O d ire ct t ra n s fe r, t h e ke rn e l p ro g ra m s t h e d is k co n t ro lle r t o t ra n s fe r t h e d a t a d ire ct ly fro m / t o p a g e s b e lo n g in g t o t h e Us e r Mo d e a d d re s s s p a ce o f a s e lf- ca ch in g a p p lica t io n . As we kn o w, a n y d a t a t ra n s fe r p ro ce e d s a s yn ch ro n o u s ly. Wh ile it is in p ro g re s s , t h e ke rn e l m a y s wit ch t h e cu rre n t p ro ce s s , t h e CPU m a y re t u rn t o Us e r Mo d e , t h e p a g e s o f t h e p ro ce s s t h a t ra is e d t h e d a t a t ra n s fe r m ig h t b e s wa p p e d o u t , a n d s o o n . Th is wo rks ju s t fin e fo r o rd in a ry I/ O d a t a t ra n s fe rs b e ca u s e t h e y in vo lve p a g e s o f t h e d is k ca ch e s . Dis k ca ch e s a re o wn e d b y t h e ke rn e l, ca n n o t b e s wa p p e d o u t , a n d a re vis ib le t o a ll p ro ce s s e s in Ke rn e l Mo d e . On t h e o t h e r h a n d , d ire ct I/ O t ra n s fe rs s h o u ld m o ve d a t a wit h in p a g e s t h a t b e lo n g t o t h e Us e r Mo d e a d d re s s s p a ce o f a g ive n p ro ce s s . Th e ke rn e l m u s t t a ke ca re t h a t t h e s e p a g e s a re a cce s s ib le b y a n y p ro ce s s in Ke rn e l Mo d e a n d t h a t t h e y a re n o t s wa p p e d o u t wh ile t h e d a t a t ra n s fe r is in p ro g re s s . Th is is a ch ie ve d t h a n ks t o t h e "d ire ct a cce s s b u ffe rs . " A d ire ct a cce s s b u ffe r co n s is t s o f a s e t o f p h ys ica l p a g e fra m e s re s e rve d fo r d ire ct I/ O d a t a t ra n s fe rs , wh ich a re m a p p e d b o t h b y t h e Us e r Mo d e Pa g e Ta b le s o f a s e lf- ca ch in g a p p lica t io n a n d b y t h e ke rn e l Pa g e Ta b le s ( t h e Ke rn e l Mo d e Pa g e Ta b le s o f e a ch p ro ce s s ) . Ea ch d ire ct a cce s s b u ffe r is d e s crib e d b y a kiobuf d a t a s t ru ct u re , wh o s e fie ld s a re s h o wn in Ta b le 1 5 - 2 .

Ta b le 1 5 - 2 . Th e fie ld s o f t h e d ire c t a c c e s s b u ffe r d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

int

nr_pages

Nu m b e r o f p a g e s in t h e d ire ct a cce s s b u ffe r

int

array_len

Nu m b e r o f fre e e le m e n t s in t h e map_array fie ld

int

offset

Offs e t t o va lid d a t a in s id e t h e firs t p a g e o f t h e d ire ct a cce s s b u ffe r

int

length

Le n g t h o f va lid d a t a in s id e t h e d ire ct a cce s s b u ffe r

struct page **

maplist

Lis t o f p a g e d e s crip t o r p o in t e rs re fe rrin g t o p a g e s in t h e d ire ct a cce s s b u ffe r ( u s u a lly p o in t s t o t h e map_array fie ld )

unsigned int

locked

Lo ck fla g fo r a ll p a g e s in t h e d ire ct a cce s s b u ffe r

struct page * []

map_array Arra y o f 1 2 9 p a g e d e s crip t o r p o in t e rs

struct buffer_head * [ ]

bh

Arra y o f 1 , 0 2 4 p re a llo ca t e d b u ffe r h e a d p o in t e rs

unsigned long [ ]

blocks

Arra y o f 1 , 0 2 4 lo g ica l b lo ck n u m b e rs

atomic_t

io_count

At o m ic fla g t h a t in d ica t e s wh e t h e r I/ O is in p ro g re s s

int

errno

Erro r n u m b e r o f la s t I/ O o p e ra t io n

void (*) (struct kiobuf *) end_io

wait_queue_head_t

Co m p le t io n m e t h o d

wait_queue Qu e u e o f p ro ce s s e s wa it in g fo r I/ O t o co m p le t e

S u p p o s e a s e lf- ca ch in g a p p lica t io n wis h e s t o d ire ct ly a cce s s a file . As a firs t s t e p , t h e a p p lica t io n o p e n s t h e file s p e cifyin g t h e O_DIRECT fla g ( s e e S e ct io n 1 2 . 6 . 1 ) . Wh ile s e rvicin g t h e open( ) s ys t e m ca ll, t h e dentry_open( ) fu n ct io n ch e cks t h e va lu e o f t h is fla g ; if it is s e t , t h e fu n ct io n in vo ke s alloc_kiovec( ), wh ich a llo ca t e s a n e w d ire ct a cce s s b u ffe r d e s crip t o r a n d s t o re s it s a d d re s s in t o t h e f_iobuf fie ld o f t h e file o b je ct . In it ia lly t h e b u ffe r in clu d e s n o p a g e fra m e s , s o t h e nr_pages fie ld o f t h e d e s crip t o r s t o re s t h e va lu e 0 . Th e

alloc_kiovec( ), h o we ve r, p re a llo ca t e s 1 , 0 2 4 b u ffe r h e a d s , wh o s e a d d re s s e s a re s t o re d in t h e bh a rra y o f t h e d e s crip t o r. Th e s e b u ffe r h e a d s e n s u re t h a t t h e s e lf- ca ch in g a p p lica t io n is n o t b lo cke d wh ile d ire ct ly a cce s s in g t h e file ( re ca ll t h a t o rd in a ry d a t a t ra n s fe rs b lo ck if n o

fre e b u ffe r h e a d s a re a va ila b le ) . A d ra wb a ck o f t h is a p p ro a ch , h o we ve r, is t h a t d a t a t ra n s fe rs m u s t b e d o n e in ch u n ks o f a t m o s t 5 1 2 KB. Ne xt , s u p p o s e t h e s e lf- ca ch in g a p p lica t io n is s u e s a read( ) o r write( ) s ys t e m ca ll o n t h e file o p e n e d wit h O_DIRECT. As m e n t io n e d e a rlie r in t h is ch a p t e r, t h e

generic_file_read( ) a n d generic_file_write( ) fu n ct io n s ch e ck t h e va lu e o f t h e fla g a n d h a n d le t h e ca s e in a s p e cia l wa y. Fo r in s t a n ce , t h e generic_file_read( ) fu n ct io n e xe cu t e s a co d e fra g m e n t e s s e n t ia lly e q u iva le n t t o t h e fo llo win g :

if (filp->f_flags & O_DIRECT) { inode = filp->f_dentry->d_inode->i_mapping->host; if (count == 0 || *ppos >= inode->i_size) return 0; if (*ppos + count > inode->i_size) count = inode->i_size - *ppos; retval = generic_file_direct_IO(READ, filp, buf, count, *ppos); if (retval > 0) *ppos += retval; UPDATE_ATIME(filp->f_dentry->d_inode); return retval; } Th e fu n ct io n ch e cks t h e cu rre n t va lu e s o f t h e file p o in t e r, t h e file s ize , a n d t h e n u m b e r o f re q u e s t e d ch a ra ct e rs , a n d t h e n in vo ke s t h e generic_file_direct_IO( ) fu n ct io n , p a s s in g t o it t h e READ o p e ra t io n t yp e , t h e file o b je ct p o in t e r, t h e a d d re s s o f t h e Us e r Mo d e b u ffe r, t h e n u m b e r o f re q u e s t e d b yt e s , a n d t h e file p o in t e r. Th e generic_file_write( ) fu n ct io n is s im ila r, b u t o f co u rs e it p a s s e s t h e WRITE o p e ra t io n t yp e t o t h e

generic_file_direct_IO( ) fu n ct io n . Th e generic_file_direct_IO( ) fu n ct io n p e rfo rm s t h e fo llo win g s t e p s :

1 . Te s t s a n d s e t s t h e f_iobuf_lock lo ck in t h e file o b je ct . If it wa s a lre a d y s e t , t h e d ire ct a cce s s b u ffe r d e s crip t o r s t o re d in f_iobuf is a lre a d y in u s e b y a co n cu rre n t d ire ct I/ O t ra n s fe r, s o t h e fu n ct io n a llo ca t e s a n e w d ire ct a cce s s b u ffe r d e s crip t o r a n d u s e s it in t h e fo llo win g s t e p s . 2 . Ch e cks t h a t t h e file p o in t e r o ffs e t a n d t h e n u m b e r o f re q u e s t e d ch a ra ct e rs a re m u lt ip le s o f t h e b lo ck s ize o f t h e file ; re t u rn s -EINVAL if t h e y a re n o t .

3 . Ch e cks t h a t t h e direct_IO m e t h o d o f t h e address_space o b je ct o f t h e file ( filp-

>f_dentry->d_inode->i_mapping) is d e fin e d ; re t u rn s -EINVAL if it is n 't . 4 . Eve n if t h e s e lf- ca ch in g a p p lica t io n is a cce s s in g t h e file d ire ct ly, t h e re co u ld b e o t h e r a p p lica t io n s in t h e s ys t e m t h a t a cce s s t h e file t h ro u g h t h e p a g e ca ch e . To a vo id d a t a lo s s , t h e d is k im a g e is s yn ch ro n ize d wit h t h e p a g e ca ch e b e fo re s t a rt in g t h e d ire ct I/ O t ra n s fe r. Th e fu n ct io n flu s h e s t h e d irt y p a g e s b e lo n g in g t o m e m o ry m a p p in g s o f t h e file t o d is k b y in vo kin g t h e filemap_fdatasync( ) fu n ct io n ( s e e t h e p re vio u s s e ct io n ) . 5 . Flu s h e s t o d is k t h e d irt y p a g e s u p d a t e d b y write( ) s ys t e m ca lls b y in vo kin g t h e

fsync_inode_data_buffers( ) fu n ct io n , a n d wa it s u n t il t h e I/ O t ra n s fe r t e rm in a t e s . 6 . In vo ke s t h e filemap_fdatawait( ) fu n ct io n t o wa it u n t il t h e I/ O o p e ra t io n s s t a rt e d in t h e S t e p 4 co m p le t e ( s e e t h e p re vio u s s e ct io n ) . 7 . S t a rt s a lo o p , a n d d ivid e s t h e d a t a t o b e t ra n s fe rre d in ch u n ks o f 5 1 2 KB. Fo r e ve ry ch u n k, t h e fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s : a . In vo ke s map_user_kiobuf( ) t o e s t a b lis h a m a p p in g b e t we e n t h e d ire ct a cce s s b u ffe r a n d t h e p o rt io n o f t h e u s e r- le ve l b u ffe r co rre s p o n d in g t o t h e ch u n k. To a ch ie ve t h is , t h e fu n ct io n : 1 . In vo ke s expand_kiobuf( ) t o a llo ca t e a n e w a rra y o f p a g e d e s crip t o r a d d re s s e s in ca s e t h e a rra y e m b e d d e d in t h e d ire ct a cce s s b u ffe r d e s crip t o r is t o o s m a ll. Th is is n o t t h e ca s e h e re , h o we ve r, b e ca u s e t h e 1 2 9 e n t rie s in t h e map_array fie ld s u ffice t o m a p t h e ch u n k o f 5 1 2 KB ( n o t ice t h a t t h e a d d it io n a l p a g e is re q u ire d wh e n t h e b u ffe r is n o t p a g e - a lig n e d ) . 2 . Acce s s e s a ll u s e r p a g e s in t h e ch u n k ( a llo ca t in g t h e m wh e n n e ce s s a ry b y s im u la t in g Pa g e Fa u lt s ) a n d s t o re s t h e ir a d d re s s e s in t h e a rra y p o in t e d t o b y t h e maplist fie ld o f t h e d ire ct a cce s s b u ffe r d e s crip t o r. 3 . Pro p e rly in it ia lize s t h e nr_pages, offset, a n d length fie ld s , a n d re s e t s t h e locked fie ld t o 0 .

b . In vo ke s t h e direct_IO m e t h o d o f t h e address_space o b je ct o f t h e file ( e xp la in e d n e xt ) . c. If t h e o p e ra t io n t yp e wa s READ, in vo ke s mark_dirty_kiobuf( ) t o m a rk t h e p a g e s m a p p e d b y t h e d ire ct a cce s s b u ffe r a s d irt y. d . In vo ke s unmap_kiobuf( ) t o re le a s e t h e m a p p in g b e t we e n t h e ch u n k a n d t h e d ire ct a cce s s b u ffe r, a n d t h e n co n t in u e s wit h t h e n e xt ch u n k. 8 . If t h e fu n ct io n a llo ca t e d a t e m p o ra ry d ire ct a cce s s b u ffe r d e s crip t o r in S t e p 1 , it re le a s e s it . Ot h e rwis e , it re le a s e s t h e f_iobuf_lock lo ck in t h e file o b je ct .

In a lm o s t a ll ca s e s , t h e direct_IO m e t h o d is a wra p p e r fo r t h e generic_direct_IO( ) fu n ct io n , p a s s in g it t h e a d d re s s o f t h e u s u a l file s ys t e m - d e p e n d e n t fu n ct io n t h a t co m p u t e s t h e p o s it io n o f t h e p h ys ica l b lo cks o n t h e b lo ck d e vice ( s e e t h e e a rlie r s e ct io n S e ct io n 1 5 . 1 . 1 ) . Th is fu n ct io n e xe cu t e s t h e fo llo win g s t e p s : 1 . Fo r e a ch b lo ck o f t h e file p o rt io n co rre s p o n d in g t o t h e cu rre n t ch u n k, in vo ke s t h e file s ys t e m - d e p e n d e n t fu n ct io n t o d e t e rm in e it s lo g ica l b lo ck n u m b e r, a n d s t o re s t h is n u m b e r in a n e n t ry o f t h e blocks a rra y in t h e d ire ct a cce s s b u ffe r d e s crip t o r. Th e 1 , 0 2 4 e n t rie s o f t h e a rra y s u ffice b e ca u s e t h e m in im u m b lo ck s ize in Lin u x is 5 1 2 b yt e s .

2 . In vo ke s t h e brw_kiovec( ) fu n ct io n , wh ich e s s e n t ia lly ca lls t h e submit_bh( ) fu n ct io n o n e a ch b lo ck in t h e blocks a rra y u s in g t h e b u ffe r h e a d s s t o re d in t h e bh a rra y o f t h e d ire ct a cce s s b u ffe r d e s crip t o r. Th e d ire ct I/ O o p e ra t io n is s im ila r t o a b u ffe r o r p a g e I/ O o p e ra t io n , b u t t h e b_end_io m e t h o d o f t h e b u ffe r h e a d s is s e t t o t h e s p e cia l fu n ct io n end_buffer_io_kiobuf( ) ra t h e r t h a n t o

end_buffer_io_sync( ) o r end_buffer_io_async( ) ( s e e S e ct io n 1 3 . 4 . 8 ) . Th e m e t h o d d e a ls wit h t h e fie ld s o f t h e kiobuf d a t a s t ru ct u re . brw_kiovec( ) d o e s n o t re t u rn u n t il t h e I/ O d a t a t ra n s fe rs a re co m p le t e d .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 16. Swapping: Methods for Freeing Memory Th e d is k ca ch e s e xa m in e d in p re vio u s ch a p t e rs u s e d RAM a s a n e xt e n s io n o f t h e d is k; t h e g o a l wa s t o im p ro ve s ys t e m re s p o n s e t im e a n d t h e s o lu t io n wa s t o re d u ce t h e n u m b e r o f d is k a cce s s e s . In t h is ch a p t e r, we in t ro d u ce a n o p p o s it e a p p ro a ch ca lle d s w a p p in g , in wh ich t h e ke rn e l u s e s s o m e s p a ce o n d is k a s a n e xt e n s io n o f RAM. S wa p p in g is t ra n s p a re n t t o t h e p ro g ra m m e r: o n ce t h e s wa p p in g a re a s a re p ro p e rly in s t a lle d a n d a ct iva t e d , t h e p ro ce s s e s m a y ru n u n d e r t h e a s s u m p t io n t h a t t h e y h a ve a ll t h e p h ys ica l m e m o ry a va ila b le t h a t t h e y ca n a d d re s s , n e ve r kn o win g t h a t s o m e o f t h e ir p a g e s a re s t o re d a wa y a n d re t rie ve d a g a in a s needed. Dis k ca ch e s e n h a n ce s ys t e m p e rfo rm a n ce a t t h e e xp e n s e o f fre e RAM, wh ile s wa p p in g e xt e n d s t h e a m o u n t o f a d d re s s a b le m e m o ry a t t h e e xp e n s e o f a cce s s s p e e d . Th u s , d is k ca ch e s a re "g o o d " a n d d e s ira b le , wh ile s wa p p in g s h o u ld b e re g a rd e d a s s o m e s o rt o f la s t re s o rt t o b e u s e d wh e n e ve r t h e a m o u n t o f fre e RAM b e co m e s t o o s ca rce . We s t a rt b y d e fin in g s wa p p in g in S e ct io n 1 6 . 1 . Th e n we d e s crib e in S e ct io n 1 6 . 2 t h e m a in d a t a s t ru ct u re s in t ro d u ce d b y Lin u x t o im p le m e n t s wa p p in g . We d is cu s s t h e s wa p ca ch e a n d t h e lo w- le ve l fu n ct io n s t h a t t ra n s fe r p a g e s b e t we e n RAM a n d s wa p a re a s , a n d vice ve rs a . Th e t wo cru cia l s e ct io n s a re S e ct io n 1 6 . 5 , wh e re we d e s crib e t h e p ro ce d u re u s e d t o s e le ct a p a g e t o b e s wa p p e d o u t t o d is k, a n d S e ct io n 1 6 . 6 , wh e re we e xp la in h o w a p a g e s t o re d in a s wa p a re a is re a d b a ck in t o RAM wh e n t h e n e e d o ccu rs . Th is ch a p t e r e ffe ct ive ly co n clu d e s o u r d is cu s s io n o f m e m o ry m a n a g e m e n t . Ju s t o n e t o p ic re m a in s t o b e co ve re d —n a m e ly, p a g e fra m e re cla im in g ; t h is is d o n e in t h e la s t s e ct io n , wh ich is re la t e d o n ly in p a rt t o s wa p p in g . Wit h s o m a n y d is k ca ch e s a ro u n d , in clu d in g t h e s wa p ca ch e , a ll t h e a va ila b le RAM co u ld e ve n t u a lly e n d u p in t h e s e ca ch e s a n d n o m o re fre e RAM wo u ld b e le ft . We s h a ll s e e h o w t h e ke rn e l p re ve n t s t h is b y m o n it o rin g t h e a m o u n t o f fre e RAM a n d b y fre e in g p a g e s fro m t h e ca ch e s o r fro m t h e p ro ce s s a d d re s s s p a ce s , a s t h e n e e d o ccu rs .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.1 What Is Swapping? S wa p p in g s e rve s t wo m a in p u rp o s e s : ● ●

To e xp a n d t h e a d d re s s s p a ce t h a t is e ffe ct ive ly u s a b le b y a p ro ce s s To e xp a n d t h e a m o u n t o f d yn a m ic RAM ( i. e . , wh a t is le ft o f t h e RAM o n ce t h e ke rn e l co d e a n d s t a t ic d a t a s t ru ct u re s h a ve b e e n in it ia lize d ) t o lo a d p ro ce s s e s

Le t 's g ive a fe w e xa m p le s o f h o w s wa p p in g b e n e fit s t h e u s e r. Th e s im p le s t is wh e n a p ro g ra m 's d a t a s t ru ct u re s t a ke u p m o re s p a ce t h a n t h e s ize o f t h e a va ila b le RAM. A s wa p a re a will a llo w t h is p ro g ra m t o b e lo a d e d wit h o u t a n y p ro b le m , a n d t h u s t o ru n co rre ct ly. A m o re s u b t le e xa m p le in vo lve s u s e rs wh o is s u e s e ve ra l co m m a n d s t h a t t ry t o s im u lt a n e o u s ly ru n la rg e a p p lica t io n s t h a t re q u ire a lo t o f m e m o ry. If n o s wa p a re a is a ct ive , t h e s ys t e m m ig h t re je ct re q u e s t s t o la u n ch a n e w a p p lica t io n . In co n t ra s t , a s wa p a re a a llo ws t h e ke rn e l t o la u n ch it , s in ce s o m e m e m o ry ca n b e fre e d a t t h e e xp e n s e o f s o m e o f t h e a lre a d y e xis t in g p ro ce s s e s wit h o u t killin g t h e m . Th e s e t wo e xa m p le s illu s t ra t e t h e b e n e fit s , b u t a ls o t h e d ra wb a cks , o f s wa p p in g . S im u la t io n o f RAM is n o t like RAM in t e rm s o f p e rfo rm a n ce . Eve ry a cce s s b y a p ro ce s s t o a p a g e t h a t is cu rre n t ly s wa p p e d o u t in cre a s e s t h e p ro ce s s e xe cu t io n t im e b y s e ve ra l o rd e rs o f m a g n it u d e . In s h o rt , if p e rfo rm a n ce is o f g re a t im p o rt a n ce , s wa p p in g s h o u ld b e u s e d o n ly a s a la s t re s o rt ; a d d in g RAM ch ip s s t ill re m a in s t h e b e s t s o lu t io n t o co p e wit h in cre a s in g co m p u t in g n e e d s . It is fa ir t o s a y, h o we ve r, t h a t in s o m e ca s e s , s wa p p in g m a y b e b e n e ficia l t o t h e s ys t e m a s a wh o le . Lo n g - ru n n in g p ro ce s s e s t yp ica lly a cce s s o n ly h a lf o f t h e p a g e fra m e s o b t a in e d . Eve n wh e n s o m e RAM is a va ila b le , s wa p p in g u n u s e d p a g e s o u t a n d u s in g t h e RAM fo r d is k ca ch e ca n im p ro ve o ve ra ll s ys t e m p e rfo rm a n ce . S wa p p in g h a s b e e n a ro u n d fo r m a n y ye a rs . Th e firs t Un ix s ys t e m ke rn e ls m o n it o re d t h e a m o u n t o f fre e m e m o ry co n s t a n t ly. Wh e n it b e ca m e le s s t h a n a fixe d t h re s h o ld , t h e y p e rfo rm e d s o m e s wa p p in g o u t . Th is a ct ivit y co n s is t e d o f co p yin g t h e e n t ire a d d re s s s p a ce o f a p ro ce s s t o d is k. Co n ve rs e ly, wh e n t h e s ch e d u lin g a lg o rit h m s e le ct e d a s wa p p e d - o u t p ro ce s s , t h e wh o le p ro ce s s wa s s wa p p e d in fro m d is k. Th is a p p ro a ch wa s a b a n d o n e d b y m o d e rn Un ix ke rn e ls , in clu d in g Lin u x, m a in ly b e ca u s e p ro ce s s s wit ch e s a re q u it e e xp e n s ive wh e n t h e y in vo lve s wa p p in g in s wa p p e d - o u t p ro ce s s e s . To co m p e n s a t e fo r t h e b u rd e n o f s u ch s wa p p in g a ct ivit y, t h e s ch e d u lin g a lg o rit h m m u s t b e ve ry s o p h is t ica t e d : it m u s t fa vo r in - RAM p ro ce s s e s wit h o u t co m p le t e ly s h u t t in g o u t t h e s wa p p e d - o u t o n e s . In Lin u x, s wa p p in g is cu rre n t ly p e rfo rm e d a t t h e p a g e le ve l ra t h e r t h a n a t t h e p ro ce s s a d d re s s s p a ce le ve l. Th is fin e r le ve l o f g ra n u la rit y h a s b e e n re a ch e d t h a n ks t o t h e in clu s io n o f a h a rd wa re p a g in g u n it in t h e CPU. Re ca ll fro m S e ct io n 2 . 4 . 1 t h a t e a ch Pa g e Ta b le e n t ry in clu d e s a Present fla g ; t h e ke rn e l ca n t a ke a d va n t a g e o f t h is fla g t o s ig n a l t o t h e h a rd wa re t h a t a p a g e b e lo n g in g t o a p ro ce s s a d d re s s s p a ce h a s b e e n s wa p p e d o u t . Be s id e s t h a t fla g , Lin u x a ls o t a ke s a d va n t a g e o f t h e re m a in in g b it s o f t h e Pa g e Ta b le e n t ry t o s t o re t h e lo ca t io n o f t h e s wa p p e d - o u t p a g e o n d is k. Wh e n a Pa g e Fa u lt e xce p t io n o ccu rs , t h e co rre s p o n d in g e xce p t io n h a n d le r ca n d e t e ct t h a t t h e p a g e is n o t p re s e n t in RAM a n d in vo ke t h e fu n ct io n t h a t s wa p s t h e m is s in g p a g e in fro m t h e d is k. Mu ch o f t h e a lg o rit h m 's co m p le xit y is t h u s re la t e d t o s wa p p in g o u t . In p a rt icu la r, fo u r m a in is s u e s m u s t b e co n s id e re d :

● ● ● ●

Wh ich kin d o f p a g e t o s wa p o u t Ho w t o d is t rib u t e p a g e s in t h e s wa p a re a s Ho w t o s e le ct t h e p a g e t o b e s wa p p e d o u t Wh e n t o p e rfo rm p a g e s wa p o u t

Le t 's g ive a s h o rt p re vie w o f h o w Lin u x h a n d le s t h e s e fo u r is s u e s b e fo re d e s crib in g t h e m a in d a t a s t ru ct u re s a n d fu n ct io n s re la t e d t o s wa p p in g .

16.1.1 Which Kind of Page to Swap Out S wa p p in g a p p lie s o n ly t o t h e fo llo win g kin d s o f p a g e s : ●

● ●

Pa g e s t h a t b e lo n g t o a n a n o n ym o u s m e m o ry re g io n o f a p ro ce s s ( fo r in s t a n ce , a Us e r Mo d e s t a ck) Mo d ifie d p a g e s t h a t b e lo n g t o a p riva t e m e m o ry m a p p in g o f a p ro ce s s Pa g e s t h a t b e lo n g t o a n IPC s h a re d m e m o ry re g io n ( s e e S e ct io n 1 9 . 3 . 5 )

Th e re m a in in g kin d s o f p a g e s a re e it h e r u s e d b y t h e ke rn e l o r u s e d t o m a p file s o n d is k. In t h e firs t ca s e , t h e y a re ig n o re d b y s wa p p in g b e ca u s e t h is s im p lifie s t h e ke rn e l d e s ig n ; in t h e s e co n d ca s e , t h e b e s t s wa p a re a s fo r t h e p a g e s a re t h e file s t h e m s e lve s .

16.1.2 How to Distribute Pages in the Swap Areas Ea ch s wa p a re a is o rg a n ize d in t o s lo t s , wh e re e a ch s lo t co n t a in s e xa ct ly o n e p a g e . Wh e n s wa p p in g o u t , t h e ke rn e l t rie s t o s t o re p a g e s in co n t ig u o u s s lo t s t o m in im ize d is k s e e k t im e wh e n a cce s s in g t h e s wa p a re a ; t h is is a n im p o rt a n t e le m e n t o f a n e fficie n t s wa p p in g a lg o rit h m . If m o re t h a n o n e s wa p a re a is u s e d , t h in g s b e co m e m o re co m p lica t e d . Fa s t e r s wa p a re a s —s wa p a re a s s t o re d in fa s t e r d is ks —g e t a h ig h e r p rio rit y. Wh e n lo o kin g fo r a fre e s lo t , t h e s e a rch s t a rt s in t h e s wa p a re a t h a t h a s t h e h ig h e s t p rio rit y. If t h e re a re s e ve ra l o f t h e m , s wa p a re a s o f t h e s a m e p rio rit y a re cyclica lly s e le ct e d t o a vo id o ve rlo a d in g o n e o f t h e m . If n o fre e s lo t is fo u n d in t h e s wa p a re a s t h a t h a ve t h e h ig h e s t p rio rit y, t h e s e a rch co n t in u e s in t h e s wa p a re a s t h a t h a ve a p rio rit y n e xt t o t h e h ig h e s t o n e , a n d s o o n .

16.1.3 How to Select the Page to Be Swapped Out Wh e n ch o o s in g p a g e s fo r s wa p o u t , it wo u ld b e n ice t o b e a b le t o ra n k t h e m a cco rd in g t o s o m e crit e rio n . S e ve ra l Le a s t Re ce n t ly Us e d ( LRU) re p la ce m e n t a lg o rit h m s h a ve b e e n p ro p o s e d a n d u s e d in s o m e ke rn e ls . Th e m a in id e a is t o a s s o cia t e a co u n t e r s t o rin g t h e a g e o f t h e p a g e wit h e a ch p a g e in RAM—t h a t is , t h e in t e rva l o f t im e e la p s e d s in ce t h e la s t a cce s s t o t h e p a g e . Th e o ld e s t p a g e o f t h e p ro ce s s ca n t h e n b e s wa p p e d o u t . S o m e co m p u t e r p la t fo rm s p ro vid e s o p h is t ica t e d s u p p o rt fo r LRU a lg o rit h m s ; fo r in s t a n ce , t h e CPUs o f s o m e m a in fra m e s a u t o m a t ica lly u p d a t e t h e va lu e o f a co u n t e r in clu d e d in e a ch Pa g e Ta b le e n t ry t o s p e cify t h e a g e o f t h e co rre s p o n d in g p a g e . Bu t 8 0 x 8 6 p ro ce s s o rs d o n o t o ffe r s u ch a h a rd wa re fe a t u re , s o Lin u x ca n n o t u s e a t ru e LRU a lg o rit h m . Ho we ve r, wh e n s e le ct in g a ca n d id a t e fo r s wa p o u t , Lin u x t a ke s a d va n t a g e o f t h e Accessed fla g in clu d e d in e a ch Pa g e Ta b le e n t ry, wh ich is a u t o m a t ica lly s e t b y t h e h a rd wa re wh e n t h e p a g e is a cce s s e d . As we 'll s e e la t e r, t h is fla g is s e t a n d cle a re d in a ra t h e r s im p lis t ic wa y t o ke e p p a g e s fro m b e in g s wa p p e d in a n d o u t t o o m u ch .

16.1.4 When to Perform Page Swap Out

S wa p p in g o u t is u s e fu l wh e n t h e ke rn e l is d a n g e ro u s ly lo w o n m e m o ry. As t h e ke rn e l's firs t d e fe n s e a g a in s t crit ica lly lo w m e m o ry, it ke e p s a s m a ll re s e rve o f fre e p a g e fra m e s t h a t ca n b e u s e d o n ly b y t h e m o s t crit ica l fu n ct io n s . Th is t u rn s o u t t o b e e s s e n t ia l t o a vo id s ys t e m cra s h e s , wh ich m ig h t o ccu r wh e n a ke rn e l ro u t in e in vo ke d t o fre e re s o u rce s is u n a b le t o o b t a in t h e m e m o ry a re a it n e e d s t o co m p le t e it s t a s k. To p ro t e ct t h is re s e rve o f fre e p a g e fra m e s , Lin u x m a y p e rfo rm a s wa p o u t o n t h e fo llo win g o cca s io n s : ●



By a ke rn e l t h re a d d e n o t e d a s k s w a p d t h a t is a ct iva t e d p e rio d ica lly wh e n e ve r t h e n u m b e r o f fre e p a g e fra m e s fa lls b e lo w a p re d e fin e d t h re s h o ld . Wh e n a m e m o ry re q u e s t t o t h e b u d d y s ys t e m ( s e e S e ct io n 7 . 1 . 7 ) ca n n o t b e s a t is fie d b e ca u s e t h e n u m b e r o f fre e p a g e fra m e s wo u ld fa ll b e lo w a p re d e fin e d t h re s h o ld .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.2 Swap Area Th e p a g e s s wa p p e d o u t fro m m e m o ry a re s t o re d in a s w a p a re a , wh ich m a y b e im p le m e n t e d e it h e r a s a d is k p a rt it io n o f it s o wn o r a s a file in clu d e d in a la rg e r p a rt it io n . S e ve ra l d iffe re n t s wa p a re a s m a y b e d e fin e d , u p t o a m a xim u m n u m b e r s p e cifie d b y t h e MAX_SWAPFILES m a cro ( u s u a lly s e t t o 3 2 ) . Ha vin g m u lt ip le s wa p a re a s a llo ws a s ys t e m a d m in is t ra t o r t o s p re a d a lo t o f s wa p s p a ce a m o n g s e ve ra l d is ks s o t h a t t h e h a rd wa re ca n a ct o n t h e m co n cu rre n t ly; it a ls o le t s s wa p s p a ce b e in cre a s e d a t ru n t im e wit h o u t re b o o t in g t h e s ys t e m . Ea ch s wa p a re a co n s is t s o f a s e q u e n ce o f p a g e s lo t s : 4 , 0 9 6 - b yt e b lo cks u s e d t o co n t a in a s wa p p e d - o u t p a g e . Th e firs t p a g e s lo t o f a s wa p a re a is u s e d t o p e rs is t e n t ly s t o re s o m e in fo rm a t io n a b o u t t h e s wa p a re a ; it s fo rm a t is d e s crib e d b y t h e swap_header u n io n co m p o s e d o f t wo s t ru ct u re s , info a n d magic. Th e magic s t ru ct u re p ro vid e s a s t rin g t h a t m a rks p a rt o f t h e d is k u n a m b ig u o u s ly a s a s wa p a re a ; it co n s is t s o f ju s t o n e fie ld , magic.magic, wh ich co n t a in s a 1 0 - ch a ra ct e r "m a g ic" s t rin g . Th e magic s t ru ct u re e s s e n t ia lly a llo ws t h e ke rn e l t o u n a m b ig u o u s ly id e n t ify a file o r a p a rt it io n a s a s wa p a re a ; t h e t e xt o f t h e s t rin g d e p e n d s o n t h e s wa p p in g a lg o rit h m ve rs io n : SWAP-SPACE fo r Ve rs io n 1 o r SWAPSPACE2 fo r Ve rs io n 2 . Th e fie ld is a lwa ys lo ca t e d a t t h e e n d o f t h e firs t p a g e s lo t .

Th e info s t ru ct u re in clu d e s t h e fo llo win g fie ld s :

info.bootbits No t u s e d b y t h e s wa p p in g a lg o rit h m ; t h is fie ld co rre s p o n d s t o t h e firs t 1 , 0 2 4 b yt e s o f t h e s wa p a re a , wh ich m a y s t o re p a rt it io n d a t a , d is k la b e ls , a n d s o o n .

info.version S wa p p in g a lg o rit h m ve rs io n .

info.last_page La s t p a g e s lo t t h a t is e ffe ct ive ly u s a b le .

info.nr_badpages Nu m b e r o f d e fe ct ive p a g e s lo t s .

info.padding[125] Pa d d in g b yt e s .

info.badpages[1] Up t o 6 3 7 n u m b e rs s p e cifyin g t h e lo ca t io n o f d e fe ct ive p a g e s lo t s .

Th e d a t a s t o re d in a s wa p a re a is m e a n in g fu l a s lo n g a s t h e s ys t e m is o n . Wh e n t h e s ys t e m is s wit ch e d o ff, a ll p ro ce s s e s a re kille d , s o a ll d a t a s t o re d b y p ro ce s s e s in s wa p a re a s is d is ca rd e d . Fo r t h is re a s o n , s wa p a re a s co n t a in ve ry lit t le co n t ro l in fo rm a t io n ; e s s e n t ia lly, t h e s wa p a re a t yp e a n d t h e lis t o f d e fe ct ive p a g e s lo t s . Th is co n t ro l in fo rm a t io n e a s ily fit s in a s in g le 4 KB p a g e . Us u a lly, t h e s ys t e m a d m in is t ra t o r cre a t e s a s wa p p a rt it io n wh e n cre a t in g t h e o t h e r p a rt it io n s o n t h e Lin u x s ys t e m , a n d t h e n u s e s t h e m k s w a p co m m a n d t o s e t u p t h e d is k a re a a s a n e w s wa p a re a . Th a t co m m a n d in it ia lize s t h e fie ld s ju s t d e s crib e d wit h in t h e firs t p a g e s lo t . S in ce t h e d is k m a y in clu d e s o m e b a d b lo cks , t h e p ro g ra m a ls o e xa m in e s a ll o t h e r p a g e s lo t s t o lo ca t e t h e d e fe ct ive o n e s . Bu t e xe cu t in g t h e m k s w a p co m m a n d le a ve s t h e s wa p a re a in a n in a ct ive s t a t e . Ea ch s wa p a re a ca n b e a ct iva t e d in a s crip t file a t s ys t e m b o o t o r d yn a m ica lly a ft e r t h e s ys t e m is ru n n in g . An in it ia lize d s wa p a re a is co n s id e re d a ct iv e wh e n it e ffe ct ive ly re p re s e n t s a n e xt e n s io n o f t h e s ys t e m RAM ( s e e S e ct io n 1 6 . 2 . 3 la t e r in t h is ch a p t e r) .

16.2.1 Swap Area Descriptor Ea ch a ct ive s wa p a re a h a s it s o wn swap_info_struct d e s crip t o r in m e m o ry. Th e fie ld s o f t h e d e s crip t o r a re illu s t ra t e d in Ta b le 1 6 - 1 .

Ta b le 1 6 - 1 . Fie ld s o f a s w a p a re a d e s c rip t o r

Ty p e

Fie ld

D e s c rip t io n

unsigned int

flags

S wa p a re a fla g s

kdev_t

swap_device De vice n u m b e r o f t h e s wa p d is k p a rt it io n

spinlock_t

sdev_lock

S wa p a re a d e s crip t o r s p in lo ck

struct dentry *

swap_file

De n t ry o f t h e file o r d e vice file

struct vfsmount * swap_vfsmnt Mo u n t e d file s ys t e m d e s crip t o r o f t h e file o r d e vice file

unsigned short * swap_map

Po in t e r t o a rra y o f co u n t e rs , o n e fo r e a ch s wa p a re a p a g e s lo t

unsigned int

lowest_bit

Firs t p a g e s lo t t o b e s ca n n e d wh e n s e a rch in g fo r a fre e o n e

unsigned int

highest_bit La s t p a g e s lo t t o b e s ca n n e d wh e n s e a rch in g fo r a fre e o n e

unsigned int

cluster_next Ne xt p a g e s lo t t o b e s ca n n e d wh e n s e a rch in g fo r a fre e o n e

unsigned int

cluster_nr

Nu m b e r o f fre e p a g e s lo t a llo ca t io n s b e fo re re s t a rt in g fro m t h e b e g in n in g

int

prio

S wa p a re a p rio rit y

int

pages

Nu m b e r o f u s a b le p a g e s lo t s

unsigned long

max

S ize o f s wa p a re a in p a g e s

int

next

Po in t e r t o n e xt s wa p a re a d e s crip t o r

Th e flags fie ld in clu d e s t wo o ve rla p p in g s u b fie ld s :

SWP_USED 1 if t h e s wa p a re a is a ct ive ; 0 if it is n o n a ct ive .

SWP_WRITEOK Th is 2 - b it fie ld is s e t t o 3 if it is p o s s ib le t o writ e in t o t h e s wa p a re a a n d t o 0 o t h e rwis e ; s in ce t h e le a s t - s ig n ifica n t b it o f t h is fie ld co in cid e s wit h t h e b it u s e d t o im p le m e n t SWP_USED, a s wa p a re a ca n b e writ t e n o n ly if it is a ct ive . Th e ke rn e l is n o t a llo we d t o writ e in a s wa p a re a wh e n it is b e in g a ct iva t e d o r d e a ct iva t e d . Th e swap_map fie ld p o in t s t o a n a rra y o f co u n t e rs , o n e fo r e a ch s wa p a re a p a g e s lo t . If t h e co u n t e r is e q u a l t o 0 , t h e p a g e s lo t is fre e ; if it is p o s it ive , t h e p a g e s lo t is fille d wit h a s wa p p e d - o u t p a g e ( t h e e xa ct m e a n in g o f p o s it ive va lu e s is d is cu s s e d la t e r in S e ct io n 1 6 . 3 ) . If t h e co u n t e r h a s t h e va lu e SWAP_MAP_MAX ( e q u a l t o 3 2 , 7 6 7 ) , t h e p a g e s t o re d in t h e p a g e s lo t is "p e rm a n e n t " a n d ca n n o t b e re m o ve d fro m t h e co rre s p o n d in g s lo t . If t h e co u n t e r h a s t h e va lu e SWAP_MAP_BAD ( e q u a l t o 3 2 , 7 6 8 ) , t h e p a g e s lo t is co n s id e re d d e fe ct ive , a n d t h u s u n u s a b le . [ 1 ] [1]

"Pe rm a n e n t " p a g e s lo t s p ro t e ct a g a in s t o ve rflo ws o f swap_map co u n t e rs . Wit h o u t t h e m , va lid p a g e s lo t s co u ld b e co m e "d e fe ct ive " if t h e y a re re fe re n ce d t o o m a n y t im e s , t h u s le a d in g t o d a t a lo s s e s . Ho we ve r, n o o n e re a lly e xp e ct s t h a t a p a g e s lo t co u n t e r co u ld re a ch t h e va lu e 3 2 , 7 6 8 . It 's ju s t a "b e lt a n d s u s p e n d e rs " a p p ro a ch . Th e prio fie ld is a s ig n e d in t e g e r t h a t d e n o t e s t h e o rd e r in wh ich t h e s wa p s u b s ys t e m s h o u ld co n s id e r e a ch s wa p a re a . S wa p a re a s im p le m e n t e d o n fa s t e r d is ks s h o u ld h a ve a h ig h e r p rio rit y s o t h e y will b e u s e d firs t . On ly wh e n t h e y a re fille d d o e s t h e s wa p p in g a lg o rit h m co n s id e r lo we r- p rio rit y s wa p a re a s . S wa p a re a s t h a t h a ve t h e s a m e p rio rit y a re

cyclica lly s e le ct e d t o d is t rib u t e s wa p p e d - o u t p a g e s a m o n g t h e m . As we s h a ll s e e in S e ct io n 1 6 . 2 . 3 , t h e p rio rit y is a s s ig n e d wh e n t h e s wa p a re a is a ct iva t e d . Th e sdev_lock fie ld is a s p in lo ck t h a t p ro t e ct s t h e d e s crip t o r a g a in s t co n cu rre n t a cce s s e s in S MP s ys t e m s . Th e swap_info a rra y in clu d e s MAX_SWAPFILES s wa p a re a d e s crip t o rs . Of co u rs e , n o t a ll o f t h e m a re n e ce s s a rily u s e d , o n ly t h o s e h a vin g t h e SWP_USED fla g s e t . Fig u re 1 6 - 1 illu s t ra t e s t h e swap_info a rra y, o n e s wa p a re a , a n d t h e co rre s p o n d in g a rra y o f co u n t e rs . Fig u re 1 6 - 1 . S w a p a re a d a t a s t ru c t u re s

Th e nr_swapfiles va ria b le s t o re s t h e in d e x o f t h e la s t a rra y e le m e n t t h a t co n t a in s , o r t h a t h a s co n t a in e d , a u s e d s wa p a re a d e s crip t o r. De s p it e it s n a m e , t h e va ria b le d o e s n o t co n t a in t h e n u m b e r o f a ct ive s wa p a re a s . De s crip t o rs o f a ct ive s wa p a re a s a re a ls o in s e rt e d in t o a lis t s o rt e d b y t h e s wa p a re a p rio rit y. Th e lis t is im p le m e n t e d t h ro u g h t h e next fie ld o f t h e s wa p a re a d e s crip t o r, wh ich s t o re s t h e in d e x o f t h e n e xt d e s crip t o r in t h e swap_info a rra y. Th is u s e o f t h e fie ld a s a n in d e x is d iffe re n t fro m m o s t fie ld s wit h t h e n a m e next, wh ich a re u s u a lly p o in t e rs .

Th e swap_list va ria b le , o f t yp e swap_list_t, in clu d e s t h e fo llo win g fie ld s :

head In d e x in t h e swap_info a rra y o f t h e firs t lis t e le m e n t .

next In d e x in t h e swap_info a rra y o f t h e d e s crip t o r o f t h e n e xt s wa p a re a t o b e s e le ct e d fo r s wa p p in g o u t p a g e s . Th is fie ld is u s e d t o im p le m e n t a ro u n d - ro b in a lg o rit h m

a m o n g m a xim u m - p rio rit y s wa p a re a s wit h fre e s lo t s . Th e swaplock s p in lo ck p ro t e ct s t h e lis t a g a in s t co n cu rre n t a cce s s e s in m u lt ip ro ce s s o r s ys t e m s . Th e max fie ld o f t h e s wa p a re a d e s crip t o r s t o re s t h e s ize o f t h e s wa p a re a in p a g e s , wh ile t h e pages fie ld s t o re s t h e n u m b e r o f u s a b le p a g e s lo t s . Th e s e n u m b e rs d iffe r b e ca u s e

pages d o e s n o t t a ke t h e firs t p a g e s lo t a n d t h e d e fe ct ive p a g e s lo t s in t o co n s id e ra t io n . Fin a lly, t h e nr_swap_pages va ria b le co n t a in s t h e n u m b e r o f a va ila b le ( fre e a n d n o n d e fe ct ive ) p a g e s lo t s in a ll a ct ive s wa p a re a s , wh ile t h e total_swap_pages va ria b le co n t a in s t h e t o t a l n u m b e r o f n o n d e fe ct ive p a g e s lo t s .

16.2.2 Swapped-Out Page Identifier A s wa p p e d - o u t p a g e is u n iq u e ly id e n t ifie d q u it e s im p ly b y s p e cifyin g t h e in d e x o f t h e s wa p a re a in t h e swap_info a rra y a n d t h e p a g e s lo t in d e x in s id e t h e s wa p a re a . S in ce t h e firs t p a g e ( wit h in d e x 0 ) o f t h e s wa p a re a is re s e rve d fo r t h e swap_header u n io n d is cu s s e d e a rlie r, t h e firs t u s e fu l p a g e s lo t h a s in d e x 1 . Th e fo rm a t o f a s w a p p e d - o u t p a g e id e n t ifie r is illu s t ra t e d in Fig u re 1 6 - 2 . Fig u re 1 6 - 2 . S w a p p e d - o u t p a g e id e n t ifie r

Th e SWP_ENTRY(type,offset) m a cro co n s t ru ct s a s wa p p e d - o u t p a g e id e n t ifie r fro m t h e s wa p a re a in d e x type a n d t h e p a g e s lo t in d e x offset. Co n ve rs e ly, t h e SWP_TYPE a n d

SWP_OFFSET m a cro s e xt ra ct fro m a s wa p p e d - o u t p a g e id e n t ifie r t h e s wa p a re a in d e x a n d t h e p a g e s lo t in d e x, re s p e ct ive ly. Wh e n a p a g e is s wa p p e d o u t , it s id e n t ifie r is in s e rt e d a s t h e p a g e 's e n t ry in t o t h e Pa g e Ta b le s o t h e p a g e ca n b e fo u n d a g a in wh e n n e e d e d . No t ice t h a t t h e le a s t - s ig n ifica n t b it o f s u ch a n id e n t ifie r, wh ich co rre s p o n d s t o t h e Present fla g , is a lwa ys cle a re d t o d e n o t e t h e fa ct t h a t t h e p a g e is n o t cu rre n t ly in RAM. Ho we ve r, a t le a s t o n e o f t h e re m a in in g 3 1 b it s h a s t o b e s e t b e ca u s e n o p a g e is e ve r s t o re d in s lo t 0 o f s wa p a re a 0 . It is t h e re fo re p o s s ib le t o id e n t ify t h re e d iffe re n t ca s e s fro m t h e va lu e o f a Pa g e Ta b le e n t ry: Nu ll e n t ry Th e p a g e d o e s n o t b e lo n g t o t h e p ro ce s s a d d re s s s p a ce . Firs t 3 1 m o s t - s ig n ifica n t b it s n o t a ll e q u a l t o 0 , la s t b it e q u a l t o 0 Th e p a g e is cu rre n t ly s wa p p e d o u t . Le a s t - s ig n ifica n t b it e q u a l t o 1

Th e p a g e is co n t a in e d in RAM. No t ice t h a t t h e m a xim u m s ize o f a s wa p a re a is d e t e rm in e d b y t h e n u m b e r o f b it s a va ila b le t o id e n t ify a s lo t . On t h e 8 0 x 8 6 a rch it e ct u re , t h e 2 4 b it s a va ila b le lim it t h e s ize o f a s wa p a re a t o 2 2 4 s lo t s ( t h a t is , t o 6 4 GB) . S in ce a p a g e m a y b e lo n g t o t h e a d d re s s s p a ce s o f s e ve ra l p ro ce s s e s ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 3 ) , it m a y b e s wa p p e d o u t fro m t h e a d d re s s s p a ce o f o n e p ro ce s s a n d s t ill re m a in in m a in m e m o ry; t h e re fo re , it is p o s s ib le t o s wa p o u t t h e s a m e p a g e s e ve ra l t im e s . A p a g e is p h ys ica lly s wa p p e d o u t a n d s t o re d ju s t o n ce , o f co u rs e , b u t e a ch s u b s e q u e n t a t t e m p t t o s wa p it o u t in cre m e n t s t h e swap_map co u n t e r.

Th e swap_duplicate( ) fu n ct io n is u s u a lly in vo ke d wh ile t ryin g t o s wa p o u t a n a lre a d y s wa p p e d - o u t p a g e . It ju s t ve rifie s t h a t t h e s wa p p e d - o u t p a g e id e n t ifie r p a s s e d a s it s p a ra m e t e r is va lid a n d in cre m e n t s t h e co rre s p o n d in g swap_map co u n t e r. Mo re p re cis e ly, it p e rfo rm s t h e fo llo win g a ct io n s : 1 . Us e s t h e SWP_TYPE a n d SWP_OFFSET m a cro s t o e xt ra ct t h e s wa p a re a n u m b e r type a n d t h e p a g e s lo t in d e x offset fro m t h e p a ra m e t e r.

2 . Ch e cks wh e t h e r t h e s wa p a re a is a ct iva t e d ; if n o t , it re t u rn s 0 ( in va lid id e n t ifie r) . 3 . Ch e cks wh e t h e r t h e p a g e s lo t is va lid a n d n o t fre e ( it s swap_map co u n t e r is g re a t e r t h a n 0 a n d le s s t h a n SWAP_MAX_BAD) ; if n o t , re t u rn s 0 ( in va lid id e n t ifie r) .

4 . Ot h e rwis e , t h e s wa p p e d - o u t p a g e id e n t ifie r lo ca t e s a va lid p a g e . In cre m e n t s t h e swap_map co u n t e r o f t h e p a g e s lo t if it h a s n o t a lre a d y re a ch e d t h e va lu e

SWAP_MAP_MAX. 5 . Re t u rn s 1 ( va lid id e n t ifie r) .

16.2.3 Activating and Deactivating a Swap Area On ce a s wa p a re a is in it ia lize d , t h e s u p e ru s e r ( o r, m o re p re cis e ly, a n y u s e r h a vin g t h e

CAP_SYS_ADMIN ca p a b ilit y, a s d e s crib e d in S e ct io n 2 0 . 1 . 1 ) m a y u s e t h e s w a p o n a n d s w a p o ff p ro g ra m s t o a ct iva t e a n d d e a ct iva t e t h e s wa p a re a , re s p e ct ive ly. Th e s e p ro g ra m s u s e t h e swapon( ) a n d swapoff( ) s ys t e m ca lls ; we 'll b rie fly s ke t ch o u t t h e co rre s p o n d in g s e rvice ro u t in e s .

16.2.3.1 The sys_swapon( ) service routine Th e sys_swapon( ) s e rvice ro u t in e re ce ive s t h e fo llo win g a s p a ra m e t e rs :

specialfile Th is p a ra m e t e r p o in t s t o t h e p a t h n a m e ( in t h e Us e r Mo d e a d d re s s s p a ce ) o f t h e d e vice file ( p a rt it io n ) o r p la in file u s e d t o im p le m e n t t h e s wa p a re a .

swap_flags

Th is p a ra m e t e r co n s is t s o f a s in g le SWAP_FLAG_PREFER b it p lu s 1 5 b it s o f p rio rit y o f t h e s wa p a re a ( t h e s e b it s a re s ig n ifica n t o n ly if t h e SWAP_FLAG_PREFER b it is o n ) .

Th e fu n ct io n ch e cks t h e fie ld s o f t h e swap_header u n io n t h a t wa s p u t in t h e firs t s lo t wh e n t h e s wa p a re a wa s cre a t e d . Th e fu n ct io n p e rfo rm s t h e s e m a in s t e p s : 1 . Ch e cks t h a t t h e cu rre n t p ro ce s s h a s t h e CAP_SYS_ADMIN ca p a b ilit y.

2 . S e a rch e s fo r t h e firs t e le m e n t in t h e swap_info a rra y o f s wa p a re a d e s crip t o rs t h a t h a ve t h e SWP_USED fla g cle a re d , m e a n in g t h a t t h e co rre s p o n d in g s wa p a re a is in a ct ive . If t h e re is n o n e , t h e re a re a lre a d y MAX_SWAPFILES a ct ive s wa p a re a s , s o t h e fu n ct io n re t u rn s a n e rro r co d e . 3 . A d e s crip t o r fo r t h e s wa p a re a h a s b e e n fo u n d . Th e fu n ct io n in it ia lize s t h e d e s crip t o r's fie ld s ( s e t t in g flags t o SWP_USED, s e t t in g lowest_bit a n d

highest_bit t o 0 , a n d s o o n ) . Mo re o ve r, if t h e d e s crip t o r's in d e x is g re a t e r t h a n nr_swapfiles, t h e fu n ct io n u p d a t e s t h a t va ria b le . 4 . If t h e swap_flags p a ra m e t e r s p e cifie s a p rio rit y fo r t h e n e w s wa p a re a , t h e fu n ct io n s e t s t h e prio fie ld o f t h e d e s crip t o r. Ot h e rwis e , it in it ia lize s t h e fie ld wit h t h e lo we s t p rio rit y a m o n g a ll a ct ive s wa p a re a s m in u s 1 ( t h u s a s s u m in g t h a t t h e la s t a ct iva t e d s wa p a re a is o n t h e s lo we s t b lo ck d e vice ) . If n o o t h e r s wa p a re a s a re a lre a d y a ct ive , t h e fu n ct io n a s s ig n s t h e va lu e - 1 . 5 . Co p ie s t h e s t rin g p o in t e d t o b y t h e specialfile p a ra m e t e r fro m t h e Us e r Mo d e a d d re s s s p a ce . 6 . In vo ke s path_init( ) a n d path_walk( ) t o p e rfo rm a p a t h n a m e lo o ku p o n t h e s t rin g co p ie d fro m t h e Us e r Mo d e a d d re s s s p a ce ( s e e S e ct io n 1 2 . 5 ) . 7 . S t o re s t h e a d d re s s e s o f t h e d e n t ry o b je ct a n d o f t h e m o u n t e d file s ys t e m d e s crip t o r re t u rn e d b y path_walk( ) in t h e swap_file a n d swap_vfsmnt fie ld s o f t h e s wa p a re a d e s crip t o r, re s p e ct ive ly. 8 . If t h e specialfile p a ra m e t e r id e n t ifie s a b lo ck d e vice file , t h e fu n ct io n p e rfo rm s t h e fo llo win g s u b s t e p s : a . S t o re s t h e d e vice n u m b e r in t h e swap_device fie ld o f t h e d e s crip t o r.

b . S e t s t h e b lo ck s ize o f t h e d e vice t o 4 KB—t h a t is , s e t s it s blksize_size e n t ry t o PAGE_SIZE.

c. In it ia lize s t h e b lo ck d e vice d rive r b y in vo kin g t h e bd_acquire( ) a n d

do_open( ) fu n ct io n s , d e s crib e d in S e ct io n 1 3 . 4 . 2 . 9 . Ch e cks t o m a ke s u re t h a t t h e s wa p a re a wa s n o t a lre a d y a ct iva t e d b y lo o kin g a t t h e

address_space o b je ct s o f t h e o t h e r a ct ive s wa p a re a s in swap_info ( g ive n a n a d d re s s q o f a s wa p a re a d e s crip t o r, t h e co rre s p o n d in g address_space o b je ct is o b t a in e d b y q->swap_file->d_inode->i_mapping) . If t h e s wa p a re a is a lre a d y a ct ive , it re t u rn s a n e rro r co d e . 1 0 . Allo ca t e s a p a g e fra m e a n d in vo ke s rw_swap_page_nolock( ) ( s e e S e ct io n 1 6 . 4 la t e r in t h is ch a p t e r) t o fill it wit h t h e swap_header u n io n s t o re d in t h e firs t p a g e o f t h e s wa p a re a . 1 1 . Ch e cks t h a t t h e m a g ic s t rin g in t h e la s t t e n ch a ra ct e rs o f t h e firs t p a g e in t h e s wa p a re a is e q u a l t o SWAP-SPACE o r t o SWAPSPACE2 ( t h e re a re t wo s lig h t ly d iffe re n t ve rs io n s o f t h e s wa p p in g a lg o rit h m ) . If n o t , t h e specialfile p a ra m e t e r d o e s n o t s p e cify a n a lre a d y in it ia lize d s wa p a re a , s o t h e fu n ct io n re t u rn s a n e rro r co d e . Fo r t h e s a ke o f b re vit y, we 'll s u p p o s e t h a t t h e s wa p a re a h a s t h e SWAPSPACE2 m a g ic s t rin g . 1 2 . In it ia lize s t h e lowest_bit a n d highest_bit fie ld s o f t h e s wa p a re a d e s crip t o r a cco rd in g t o t h e s ize o f t h e s wa p a re a s t o re d in t h e info.last_page fie ld o f t h e

swap_header u n io n . 1 3 . In vo ke s vmalloc( ) t o cre a t e t h e a rra y o f co u n t e rs a s s o cia t e d wit h t h e n e w s wa p a re a a n d s t o re it s a d d re s s in t h e swap_map fie ld o f t h e s wa p d e s crip t o r. In it ia lize s t h e e le m e n t s o f t h e a rra y t o 0 o r t o SWAP_MAP_BAD, a cco rd in g t o t h e lis t o f d e fe ct ive p a g e s lo t s s t o re d in t h e info.bad_pages fie ld o f t h e swap_header u n io n .

1 4 . Co m p u t e s t h e n u m b e r o f u s e fu l p a g e s lo t s b y a cce s s in g t h e info.last_page a n d

info.nr_badpages fie ld s in t h e firs t p a g e s lo t . 1 5 . S e t s t h e flags fie ld o f t h e s wa p d e s crip t o r t o SWP_WRITEOK, s e t s t h e pages fie ld t o t h e n u m b e r o f u s e fu l p a g e s lo t s , a n d u p d a t e s t h e nr_swap_pages a n d

total_swap_pages va ria b le s . 1 6 . In s e rt s t h e n e w s wa p a re a d e s crip t o r in t h e lis t t o wh ich t h e swap_list va ria b le p o in t s . 1 7 . Re le a s e s t h e p a g e fra m e t h a t co n t a in s t h e d a t a o f t h e firs t p a g e o f t h e s wa p a re a a n d re t u rn s 0 ( s u cce s s ) .

16.2.3.2 The sys_swapoff( ) service routine Th e sys_swapoff( ) s e rvice ro u t in e d e a ct iva t e s a s wa p a re a id e n t ifie d b y t h e p a ra m e t e r

specialfile. It is m u ch m o re co m p le x a n d t im e - co n s u m in g t h a n sys_swapon( ), s in ce t h e p a rt it io n t o b e d e a ct iva t e d m ig h t s t ill co n t a in p a g e s t h a t b e lo n g t o s e ve ra l p ro ce s s e s . Th e fu n ct io n is t h u s fo rce d t o s ca n t h e s wa p a re a a n d t o s wa p in a ll e xis t in g p a g e s . S in ce e a ch s wa p in re q u ire s a n e w p a g e fra m e , it m ig h t fa il if t h e re a re n o fre e p a g e fra m e s le ft . In t h is ca s e , t h e fu n ct io n re t u rn s a n e rro r co d e . All t h is is a ch ie ve d b y p e rfo rm in g t h e fo llo win g m a jo r s t e p s :

1 . Ch e cks t h a t t h e cu rre n t p ro ce s s h a s t h e CAP_SYS_ADMIN ca p a b ilit y.

2 . Co p ie s t h e s t rin g p o in t e d t o b y specialfile, a n d in vo ke s path_init( ) a n d

path_walk( ) t o p e rfo rm a p a t h n a m e lo o ku p . 3 . S ca n s t h e lis t t o wh ich swap_list p o in t s a n d lo ca t e s t h e d e s crip t o r wh o s e

swap_file fie ld p o in t s t o t h e d e n t ry o b je ct fo u n d b y t h e p a t h n a m e lo o ku p . If n o s u ch d e s crip t o r e xis t s , a n in va lid p a ra m e t e r wa s p a s s e d t o t h e fu n ct io n , s o it re t u rn s a n e rro r co d e . 4 . Ot h e rwis e , if t h e d e s crip t o r e xis t s , ch e cks t h a t it s SWP_WRITE fla g is s e t ; if n o t , re t u rn s a n e rro r co d e b e ca u s e t h e s wa p a re a is a lre a d y b e in g d e a ct iva t e d b y a n o t h e r p ro ce s s . 5 . Re m o ve s t h e d e s crip t o r fro m t h e lis t a n d s e t s it s flags fie ld t o SWP_USED s o t h e ke rn e l d o e s n 't s t o re m o re p a g e s in t h e s wa p a re a b e fo re t h is fu n ct io n d e a ct iva t e s it . 6 . S u b t ra ct s t h e s wa p a re a s ize s t o re d in t h e pages fie ld o f t h e s wa p a re a d e s crip t o r fro m t h e va lu e s o f nr_swap_pages a n d total_swap_pages.

7 . In vo ke s t h e try_to_unuse( ) fu n ct io n ( s e e b e lo w) t o s u cce s s ive ly fo rce a ll p a g e s le ft in t h e s wa p a re a in t o RAM a n d t o co rre s p o n d in g ly u p d a t e t h e Pa g e Ta b le s o f t h e p ro ce s s e s t h a t u s e t h e s e p a g e s . 8 . If try_to_unuse( ) fa ils in a llo ca t in g a ll re q u e s t e d p a g e fra m e s , t h e s wa p a re a ca n n o t b e d e a ct iva t e d . Th e re fo re , t h e fu n ct io n e xe cu t e s t h e fo llo win g s u b s t e p s : a . Re in s e rt s t h e s wa p a re a d e s crip t o r in t h e swap_list lis t a n d s e t s it s flags fie ld t o SWP_WRITEOK ( s e e S t e p 5 )

b . Ad d s t h e co n t e n t o f t h e pages fie ld t o t h e nr_swap_pages a n d

total_swap_pages va ria b le s ( s e e S t e p 6 ) c. In vo ke s path_release( ) t o re le a s e t h e VFS o b je ct s a llo ca t e d b y

path_walk( ) in S t e p 2 . d . Fin a lly, re t u rn s a n e rro r co d e . 9 . Ot h e rwis e , a ll u s e d p a g e s lo t s h a ve b e e n s u cce s s fu lly t ra n s fe rre d t o RAM. Th e re fo re , t h e fu n ct io n e xe cu t e s t h e fo llo win g s u b s t e p s : a . If specialfile id e n t ifie s a b lo ck d e vice file , re le a s e s t h e co rre s p o n d in g b lo ck d e vice d rive r. b . In vo ke s path_release( ) t o re le a s e t h e VFS o b je ct s a llo ca t e d b y

path_walk( ) in S t e p 2 .

c. Re le a s e s t h e m e m o ry a re a u s e d t o s t o re t h e swap_map a rra y.

d . In vo ke s path_release( ) a g a in b e ca u s e t h e VFS o b je ct s t h a t re fe r t o

specialfile h a ve b e e n a llo ca t e d b y t h e path_walk( ) fu n ct io n in vo ke d b y sys_swapon( ) ( s e e S t e p 6 in t h e p re vio u s s e ct io n ) . e . Re t u rn s 0 ( s u cce s s ) .

16.2.3.3 The try_to_unuse( ) function As s t a t e d p re vio u s ly, t h e try_to_unuse( ) fu n ct io n s wa p s in p a g e s a n d u p d a t e s a ll t h e Pa g e Ta b le s o f p ro ce s s e s t h a t h a ve s wa p p e d o u t p a g e s . To t h a t e n d , t h e fu n ct io n vis it s t h e a d d re s s s p a ce s o f a ll ke rn e l t h re a d s a n d p ro ce s s e s , s t a rt in g wit h t h e init_mm m e m o ry d e s crip t o r t h a t is u s e d a s a m a rke r. It is a t im e - co n s u m in g fu n ct io n t h a t ru n s m o s t ly wit h t h e in t e rru p t s e n a b le d . S yn ch ro n iza t io n wit h o t h e r p ro ce s s e s is t h e re fo re crit ica l. Th e try_to_unuse( ) fu n ct io n s ca n s t h e swap_map a rra y o f t h e s wa p a re a . Wh e n t h e fu n ct io n fin d s a in - u s e p a g e s lo t , it firs t s wa p s in t h e p a g e , a n d t h e n s t a rt s lo o kin g fo r t h e p ro ce s s e s t h a t re fe re n ce t h e p a g e . Th e o rd e rin g o f t h e s e t wo o p e ra t io n s is cru cia l t o a vo id ra ce co n d it io n s . Wh ile t h e I/ O d a t a t ra n s fe r is o n g o in g , t h e p a g e is lo cke d , s o n o p ro ce s s ca n a cce s s it . On ce t h e I/ O d a t a t ra n s fe r co m p le t e s , t h e p a g e is lo cke d a g a in b y try_to_unuse( ), s o it ca n n o t b e s wa p p e d o u t a g a in b y a n o t h e r ke rn e l co n t ro l p a t h . Ra ce co n d it io n s a re a ls o a vo id e d b e ca u s e e a ch p ro ce s s lo o ks u p t h e p a g e ca ch e b e fo re s t a rt in g a s wa p in o r s wa p o u t o p e ra t io n ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 3 ) . Fin a lly, t h e s wa p a re a co n s id e re d b y try_to_unuse( ) is m a rke d a s n o n writ a b le ( SWP_WRITE fla g is n o t s e t ) , s o n o p ro ce s s ca n p e rfo rm a s wa p o u t o n a p a g e s lo t o f t h is a re a . Ho we ve r, try_to_unuse( ) m ig h t b e fo rce d t o s ca n t h e swap_map a rra y o f u s a g e co u n t e rs o f t h e s wa p a re a s e ve ra l t im e s . Th is is b e ca u s e m e m o ry re g io n s t h a t co n t a in re fe re n ce s t o s wa p p e d - o u t p a g e s m ig h t d is a p p e a r d u rin g o n e s ca n a n d la t e r re a p p e a r in t h e p ro ce s s lis t s . Fo r in s t a n ce , re ca ll t h e d e s crip t io n o f t h e do_munmap( ) fu n ct io n ( in S e ct io n 8 . 3 . 5 ) : wh e n e ve r a p ro ce s s re le a s e s a n in t e rva l o f lin e a r a d d re s s e s , do_munmap( ) re m o ve s fro m t h e p ro ce s s lis t a ll m e m o ry re g io n s t h a t in clu d e t h e a ffe ct e d lin e a r a d d re s s e s ; la t e r, t h e fu n ct io n re in s e rt s t h e m e m o ry re g io n s t h a t h a ve b e e n o n ly p a rt ia lly u n m a p p e d in t h e p ro ce s s lis t . do_munmap( ) t a ke s ca re o f fre e in g t h e s wa p p e d - o u t p a g e s t h a t b e lo n g t o t h e in t e rva l o f re le a s e d lin e a r a d d re s s e s ; h o we ve r, it co m m e n d a b ly d o e s n 't fre e t h e s wa p p e d o u t p a g e s t h a t b e lo n g t o t h e m e m o ry re g io n s t h a t h a ve t o b e re in s e rt e d in t h e p ro ce s s lis t . He n ce , try_to_unuse( ) m ig h t fa il in fin d in g a p ro ce s s t h a t re fe re n ce s a g ive n p a g e s lo t b e ca u s e t h e co rre s p o n d in g m e m o ry re g io n is t e m p o ra rily n o t in clu d e d in t h e p ro ce s s lis t . To co p e wit h t h is fa ct , try_to_unuse( ) ke e p s s ca n n in g t h e swap_map a rra y u n t il a ll re fe re n ce co u n t e rs a re n u ll. Eve n t u a lly, t h e g h o s t m e m o ry re g io n s re fe re n cin g t h e s wa p p e d o u t p a g e s will re a p p e a r in t h e p ro ce s s lis t s , s o try_to_unuse( ) will s u cce e d in fre e in g a ll p a g e s lo t s . Le t 's d e s crib e n o w t h e m a jo r o p e ra t io n s e xe cu t e d b y try_to_unuse( ). It e xe cu t e s a co n t in u o u s lo o p o n t h e re fe re n ce co u n t e rs in t h e swap_map a rra y o f t h e s wa p a re a p a s s e d a s it s p a ra m e t e r. Fo r e a ch re fe re n ce co u n t e r, t h e fu n ct io n p e rfo rm s t h e fo llo win g s t e p s :

1 . If t h e co u n t e r is e q u a l t o 0 ( n o p a g e is s t o re d t h e re ) o r t o SWAP_MAP_BAD, it co n t in u e s wit h t h e n e xt p a g e s lo t . 2 . Ot h e rwis e , it in vo ke s t h e read_swap_cache_async( ) fu n ct io n ( s e e S e ct io n 1 6 . 4 la t e r in t h is ch a p t e r) t o s wa p in t h e p a g e . Th is co n s is t s o f a llo ca t in g , if n e ce s s a ry, a n e w p a g e fra m e , fillin g it wit h t h e d a t a s t o re d in t h e p a g e s lo t , a n d p u t t in g t h e p a g e in t h e s wa p ca ch e . 3 . Wa it s u n t il t h e n e w p a g e h a s b e e n p ro p e rly u p d a t e d fro m d is k a n d lo cks it . 4 . Wh ile t h e fu n ct io n wa s e xe cu t in g t h e p re vio u s s t e p , t h e p ro ce s s co u ld h a ve b e e n s u s p e n d e d . Th e re fo re , it ch e cks a g a in wh e t h e r t h e re fe re n ce co u n t e r o f t h e p a g e s lo t is n u ll; if s o , it co n t in u e s wit h t h e n e xt p a g e s lo t ( t h is s wa p p a g e h a s b e e n fre e d b y a n o t h e r ke rn e l co n t ro l p a t h ) . 5 . In vo ke s unuse_process( ) o n e ve ry m e m o ry d e s crip t o r in t h e d o u b ly lin ke d lis t wh o s e h e a d is init_mm ( s e e S e ct io n 8 . 2 ) . Th is t im e - co n s u m in g fu n ct io n s ca n s a ll Pa g e Ta b le e n t rie s o f t h e p ro ce s s t h a t o wn s t h e m e m o ry d e s crip t o r, a n d re p la ce s e a ch o ccu rre n ce o f t h e s wa p p e d - o u t p a g e id e n t ifie r wit h t h e p h ys ica l a d d re s s o f t h e p a g e fra m e . To re fle ct t h is m o ve , t h e fu n ct io n a ls o d e cre m e n t s t h e p a g e s lo t co u n t e r in t h e swap_map a rra y ( u n le s s it is e q u a l t o SWAP_MAP_MAX) a n d in cre m e n t s t h e u s a g e co u n t e r o f t h e p a g e fra m e . 6 . In vo ke s shmem_unuse( ) t o ch e ck wh e t h e r t h e s wa p p e d - o u t p a g e is u s e d fo r a n IPC s h a re d m e m o ry re s o u rce a n d t o p ro p e rly h a n d le t h a t ca s e ( s e e S e ct io n 1 9 . 3 . 5 ) . 7 . Ch e cks t h e va lu e o f t h e re fe re n ce co u n t e r o f t h e p a g e . If it is e q u a l t o SWAP_MAP_MAX, t h e p a g e s lo t is "p e rm a n e n t . " To fre e it , it fo rce s t h e va lu e 1 in t o t h e re fe re n ce co u n t e r. 8 . Th e s wa p ca ch e m ig h t o wn t h e p a g e a s we ll ( it co n t rib u t e s t o t h e va lu e o f t h e re fe re n ce co u n t e r) . If t h e p a g e b e lo n g s t o t h e s wa p ca ch e , it in vo ke s t h e rw_swap_page( ) fu n ct io n t o flu s h it s co n t e n t s o n d is k ( if t h e p a g e is d irt y) , in vo ke s delete_from_swap_cache( ) t o re m o ve t h e p a g e fro m t h e s wa p ca ch e , a n d d e cre m e n t s it s re fe re n ce co u n t e r. 9 . S e t s t h e PG_dirty fla g o f t h e p a g e d e s crip t o r a n d u n lo cks t h e p a g e .

1 0 . Ch e cks t h e need_resched fie ld o f t h e cu rre n t p ro ce s s ; if it is s e t , it in vo ke s

schedule( ) t o re lin q u is h t h e CPU. De a ct iva t in g a s wa p a re a is a lo n g jo b , a n d t h e ke rn e l m u s t e n s u re t h a t t h e o t h e r p ro ce s s e s in t h e s ys t e m s t ill co n t in u e t o e xe cu t e . Th e try_to_unuse( ) fu n ct io n co n t in u e s fro m t h is s t e p wh e n e ve r t h e p ro ce s s is s e le ct e d a g a in b y t h e s ch e d u le r. 1 1 . Pro ce e d s wit h t h e n e xt p a g e s lo t . s t a rt in g a t S t e p 1 . Th e fu n ct io n co n t in u e s u n t il e ve ry re fe re n ce co u n t e r in t h e swap_map a rra y is n u ll. Re ca ll t h a t e ve n if t h e fu n ct io n s t a rt s e xa m in in g t h e n e xt p a g e s lo t , t h e re fe re n ce co u n t e r o f t h e p re vio u s p a g e s lo t co u ld s t ill b e p o s it ive . In fa ct , a "g h o s t " p ro ce s s co u ld s t ill re fe re n ce t h e p a g e , t yp ica lly b e ca u s e s o m e m e m o ry re g io n s h a ve b e e n t e m p o ra rily re m o ve d fro m t h e

p ro ce s s lis t s ca n n e d in S t e p 5 . Eve n t u a lly, try_to_unuse( ) ca t ch e s e ve ry re fe re n ce . In t h e m e a n t im e , h o we ve r, t h e p a g e is n o lo n g e r in t h e s wa p ca ch e , it is u n lo cke d , a n d a co p y is s t ill in clu d e d in t h e p a g e s lo t o f t h e s wa p a re a b e in g d e a ct iva t e d . On e m ig h t e xp e ct t h a t t h is s it u a t io n co u ld le a d t o d a t a lo s s . Fo r in s t a n ce , s u p p o s e t h a t s o m e "g h o s t " p ro ce s s a cce s s e s t h e p a g e s lo t a n d s t a rt s s wa p p in g t h e p a g e in . S in ce t h e p a g e is n o lo n g e r in t h e s wa p ca ch e , t h e p ro ce s s fills a n e w p a g e fra m e wit h t h e d a t a re a d fro m d is k. Ho we ve r, t h is p a g e fra m e wo u ld b e d iffe re n t fro m t h e p a g e fra m e s o wn e d b y t h e p ro ce s s e s t h a t a re s u p p o s e d t o s h a re t h e p a g e wit h t h e "g h o s t " p ro ce s s . Th is p ro b le m d o e s n o t a ris e wh e n d e a ct iva t in g a s wa p a re a b e ca u s e in t e rfe re n ce fro m a g h o s t p ro ce s s co u ld h a p p e n o n ly if a s wa p p e d - o u t p a g e b e lo n g s t o a p riva t e a n o n ym o u s m e m o ry m a p p in g . [ 2 ] In t h is ca s e , t h e p a g e fra m e is h a n d le d b y m e a n s o f t h e Co p y o n Writ e m e ch a n is m d e s crib e d in Ch a p t e r 8 , s o it is p e rfe ct ly le g a l t o a s s ig n d iffe re n t p a g e fra m e s t o t h e p ro ce s s e s t h a t re fe re n ce t h e p a g e . Ho we ve r, t h e try_to_unuse( ) fu n ct io n m a rks t h e p a g e a s "d irt y" ( S t e p 9 ) ; o t h e rwis e , t h e try_to_swap_out( ) fu n ct io n m ig h t la t e r d ro p t h e p a g e fro m t h e Pa g e Ta b le o f s o m e p ro ce s s wit h o u t s a vin g it in a n a n o t h e r s wa p a re a ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 5 ) . [2]

Act u a lly, t h e p a g e m ig h t a ls o b e lo n g t o a n IPC s h a re d m e m o ry re g io n ; Ch a p t e r 1 9 h a s a d is cu s s io n o f t h is ca s e .

16.2.4 Allocating and Releasing a Page Slot As we s h a ll s e e la t e r, wh e n fre e in g m e m o ry, t h e ke rn e l s wa p s o u t m a n y p a g e s in a s h o rt p e rio d o f t im e . It is t h e re fo re im p o rt a n t t o t ry t o s t o re t h e s e p a g e s in co n t ig u o u s s lo t s t o m in im ize d is k s e e k t im e wh e n a cce s s in g t h e s wa p a re a . A firs t a p p ro a ch t o a n a lg o rit h m t h a t s e a rch e s fo r a fre e s lo t co u ld ch o o s e o n e o f t wo s im p lis t ic, ra t h e r e xt re m e s t ra t e g ie s : ●



Alwa ys s t a rt fro m t h e b e g in n in g o f t h e s wa p a re a . Th is a p p ro a ch m a y in cre a s e t h e a ve ra g e s e e k t im e d u rin g s wa p - o u t o p e ra t io n s b e ca u s e fre e p a g e s lo t s m a y b e s ca t t e re d fa r a wa y fro m o n e a n o t h e r. Alwa ys s t a rt fro m t h e la s t a llo ca t e d p a g e s lo t . Th is a p p ro a ch in cre a s e s t h e a ve ra g e s e e k t im e d u rin g s wa p - in o p e ra t io n s if t h e s wa p a re a is m o s t ly fre e ( a s is u s u a lly t h e ca s e ) b e ca u s e t h e h a n d fu l o f o ccu p ie d p a g e s lo t s m a y b e s ca t t e re d fa r a wa y fro m o n e a n o t h e r.

Lin u x a d o p t s a h yb rid a p p ro a ch . It a lwa ys s t a rt s fro m t h e la s t a llo ca t e d p a g e s lo t u n le s s o n e o f t h e s e co n d it io n s o ccu rs : ● ●

Th e e n d o f t h e s wa p a re a is re a ch e d SWAPFILE_CLUSTER ( u s u a lly 2 5 6 ) fre e p a g e s lo t s we re a llo ca t e d a ft e r t h e la s t re s t a rt fro m t h e b e g in n in g o f t h e s wa p a re a

Th e cluster_nr fie ld in t h e swap_info_struct d e s crip t o r s t o re s t h e n u m b e r o f fre e p a g e s lo t s a llo ca t e d . Th is fie ld is re s e t t o 0 wh e n t h e fu n ct io n re s t a rt s a llo ca t io n fro m t h e b e g in n in g o f t h e s wa p a re a . Th e cluster_next fie ld s t o re s t h e in d e x o f t h e firs t p a g e s lo t t o b e e xa m in e d in t h e n e xt a llo ca t io n . [ 3 ]

[3]

As yo u m a y h a ve n o t ice d , t h e n a m e s o f Lin u x d a t a s t ru ct u re s a re n o t a lwa ys a p p ro p ria t e . In t h is ca s e , t h e ke rn e l d o e s n o t re a lly "clu s t e r" p a g e s lo t s o f a s wa p a re a .

To s p e e d u p t h e s e a rch fo r fre e p a g e s lo t s , t h e ke rn e l ke e p s t h e lowest_bit a n d

highest_bit fie ld s o f e a ch s wa p a re a d e s crip t o r u p t o d a t e . Th e s e fie ld s s p e cify t h e firs t a n d t h e la s t p a g e s lo t s t h a t co u ld b e fre e ; in o t h e r wo rd s , a n y p a g e s lo t b e lo w lowest_bit a n d a b o ve highest_bit is kn o wn t o b e o ccu p ie d . 16.2.4.1 The scan_swap_map( ) function Th e scan_swap_map( ) fu n ct io n is u s e d t o fin d a fre e p a g e s lo t in a g ive n s wa p a re a . It a ct s o n a s in g le p a ra m e t e r, wh ich p o in t s t o a s wa p a re a d e s crip t o r a n d re t u rn s t h e in d e x o f a fre e p a g e s lo t . It re t u rn s 0 if t h e s wa p a re a d o e s n o t co n t a in a n y fre e s lo t s . Th e fu n ct io n p e rfo rm s t h e fo llo win g s t e p s : 1 . It t rie s firs t t o u s e t h e cu rre n t clu s t e r. If t h e cluster_nr fie ld o f t h e s wa p a re a d e s crip t o r is p o s it ive , it s ca n s t h e swap_map a rra y o f co u n t e rs s t a rt in g fro m t h e e le m e n t a t in d e x cluster_next a n d lo o ks fo r a n u ll e n t ry. If a n u ll e n t ry is fo u n d , it d e cre m e n t s t h e cluster_nr fie ld a n d g o e s t o S t e p 4 .

2 . If t h is p o in t is re a ch e d , e it h e r t h e cluster_nr fie ld is n u ll o r t h e s e a rch s t a rt in g fro m cluster_next d id n 't fin d a n u ll e n t ry in t h e swap_map a rra y. It is t im e t o t ry t h e s e co n d s t a g e o f t h e h yb rid s e a rch . It re in it ia lize s cluster_nr t o

SWAPFILE_CLUSTER a n d re s t a rt s s ca n n in g t h e a rra y fro m t h e lowest_bit in d e x t h a t is t ryin g t o fin d a g ro u p o f SWAPFILE_CLUSTER fre e p a g e s lo t s . If s u ch a g ro u p is fo u n d , it g o e s t o S t e p 4 . 3 . No g ro u p o f SWAPFILE_CLUSTER fre e p a g e s lo t s e xis t s . It re s t a rt s s ca n n in g t h e a rra y fro m t h e lowest_bit in d e x t h a t is t ryin g t o fin d a s in g le fre e p a g e s lo t . If n o n u ll e n t ry is fo u n d , it s e t s t h e lowest_bit fie ld t o t h e m a xim u m in d e x in t h e a rra y, t h e highest_bit fie ld t o 0 , a n d re t u rn s 0 ( t h e s wa p a re a is fu ll) .

4 . A n u ll e n t ry is fo u n d . Pu t s t h e va lu e 1 in t h e e n t ry, d e cre m e n t s nr_swap_pages, u p d a t e s t h e lowest_bit a n d highest_bit fie ld s if n e ce s s a ry, a n d s e t s t h e

cluster_next fie ld t o t h e in d e x o f t h e p a g e s lo t ju s t a llo ca t e d p lu s 1 . 5 . Re t u rn s t h e in d e x o f t h e a llo ca t e d p a g e s lo t .

16.2.4.2 The get_swap_page( ) function Th e get_swap_page( ) fu n ct io n is u s e d t o fin d a fre e p a g e s lo t b y s e a rch in g a ll t h e a ct ive s wa p a re a s . Th e fu n ct io n , wh ich re t u rn s t h e in d e x o f a n e wly a llo ca t e d p a g e s lo t o r 0 if a ll s wa p a re a s a re fille d , t a ke s in t o co n s id e ra t io n t h e d iffe re n t p rio rit ie s o f t h e a ct ive s wa p a re a s . Two p a s s e s a re n e ce s s a ry. Th e firs t p a s s is p a rt ia l a n d a p p lie s o n ly t o a re a s t h a t h a ve a s in g le p rio rit y; t h e fu n ct io n s e a rch e s s u ch a re a s in a ro u n d - ro b in fa s h io n fo r a fre e s lo t . If n o

fre e p a g e s lo t is fo u n d , a s e co n d p a s s is m a d e s t a rt in g fro m t h e b e g in n in g o f t h e s wa p a re a lis t ; d u rin g t h is s e co n d p a s s , a ll s wa p a re a s a re e xa m in e d . Mo re p re cis e ly, t h e fu n ct io n p e rfo rm s t h e fo llo win g s t e p s : 1 . If nr_swap_pages is n u ll o r if t h e re a re n o a ct ive s wa p a re a s , re t u rn s 0 .

2 . S t a rt s b y co n s id e rin g t h e s wa p a re a p o in t e d t o b y swap_list.next ( re ca ll t h a t t h e s wa p a re a lis t is s o rt e d b y d e cre a s in g p rio rit ie s ) . 3 . If t h e s wa p a re a is a ct ive a n d n o t b e in g d e a ct iva t e d , in vo ke s scan_swap_map( ) t o a llo ca t e a fre e p a g e s lo t . If scan_swap_map( ) re t u rn s a p a g e s lo t in d e x, t h e fu n ct io n 's jo b is e s s e n t ia lly d o n e , b u t it m u s t p re p a re fo r it s n e xt in vo ca t io n . Th u s , it u p d a t e s swap_list.next t o p o in t t o t h e n e xt s wa p a re a in t h e s wa p a re a lis t , if t h e la t t e r h a s t h e s a m e p rio rit y ( t h u s co n t in u in g t h e ro u n d - ro b in u s e o f t h e s e s wa p a re a s ) . If t h e n e xt s wa p a re a d o e s n o t h a ve t h e s a m e p rio rit y a s t h e cu rre n t o n e , t h e fu n ct io n s e t s swap_list.next t o t h e firs t s wa p a re a in t h e lis t ( s o t h a t t h e n e xt s e a rch will s t a rt wit h t h e s wa p a re a s t h a t h a ve t h e h ig h e s t p rio rit y) . Th e fu n ct io n fin is h e s b y re t u rn in g t h e id e n t ifie r co rre s p o n d in g t o t h e p a g e s lo t ju s t a llo ca t e d . 4 . Eit h e r t h e s wa p a re a is n o t writ a b le , o r it d o e s n o t h a ve fre e p a g e s lo t s . If t h e n e xt s wa p a re a in t h e s wa p a re a lis t h a s t h e s a m e p rio rit y a s t h e cu rre n t o n e , t h e fu n ct io n m a ke s it t h e cu rre n t o n e a n d g o e s t o S t e p 3 . 5 . At t h is p o in t , t h e n e xt s wa p a re a in t h e s wa p a re a lis t h a s a lo we r p rio rit y t h a n t h e p re vio u s o n e . Th e n e xt s t e p d e p e n d s o n wh ich o f t h e t wo p a s s e s t h e fu n ct io n is p e rfo rm in g . a . If t h is is t h e firs t ( p a rt ia l) p a s s , it co n s id e rs t h e firs t s wa p a re a in t h e lis t a n d g o e s t o S t e p 3 , t h u s s t a rt in g t h e s e co n d p a s s . b . Ot h e rwis e , it ch e cks if t h e re is a n e xt e le m e n t in t h e lis t ; if s o , it co n s id e rs it a nd goe s to Ste p 3. 6 . At t h is p o in t t h e lis t is co m p le t e ly s ca n n e d b y t h e s e co n d p a s s a n d n o fre e p a g e s lo t h a s b e e n fo u n d ; it re t u rn s 0 .

16.2.4.3 The swap_free( ) function Th e swap_free( ) fu n ct io n is in vo ke d wh e n s wa p p in g in a p a g e t o d e cre m e n t t h e co rre s p o n d in g swap_map co u n t e r ( s e e Ta b le 1 6 - 1 ) . Wh e n t h e co u n t e r re a ch e s 0 , t h e p a g e s lo t b e co m e s fre e s in ce it s id e n t ifie r is n o lo n g e r in clu d e d in a n y Pa g e Ta b le e n t ry. We 'll s e e in t h e la t e r s e ct io n S e ct io n 1 6 . 3 , h o we ve r, t h a t t h e s wa p ca ch e co u n t s a s a n o wn e r o f t h e p a g e s lo t . Th e fu n ct io n a ct s o n a s in g le entry p a ra m e t e r t h a t s p e cifie s a s wa p p e d - o u t p a g e id e n t ifie r a n d p e rfo rm s t h e fo llo win g s t e p s : 1 . De rive s t h e s wa p a re a in d e x a n d t h e offset p a g e s lo t in d e x fro m t h e entry p a ra m e t e r a n d g e t s t h e a d d re s s o f t h e s wa p a re a d e s crip t o r.

2 . Ch e cks wh e t h e r t h e s wa p a re a is a ct ive a n d re t u rn s rig h t a wa y if it is n o t . 3 . If t h e swap_map co u n t e r co rre s p o n d in g t o t h e p a g e s lo t b e in g fre e d is s m a lle r t h a n

SWAP_MAP_MAX, d e cre m e n t s it . Re ca ll t h a t e n t rie s t h a t h a ve t h e SWAP_MAP_MAX va lu e a re co n s id e re d p e rs is t e n t ( u n d e le t a b le ) . 4 . If t h e swap_map co u n t e r b e co m e s 0 , in cre m e n t s t h e va lu e o f nr_swap_pages a n d u p d a t e s , if n e ce s s a ry, t h e lowest_bit a n d highest_bit fie ld s o f t h e s wa p a re a d e s crip t o r. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.3 The Swap Cache Th e s wa p ca ch e is cru cia l t o a vo id ra ce co n d it io n s a m o n g p ro ce s s e s t ryin g t o a cce s s p a g e s t h a t a re b e in g s wa p p e d . If a p a g e is o wn e d b y a s in g le p ro ce s s ( o r b e t t e r, if t h e p a g e b e lo n g s t o a n a d d re s s s p a ce t h a t is o wn e d b y o n e o r m o re clo n e p ro ce s s e s ) , t h e re is ju s t o n e ra ce co n d it io n t o b e co n s id e re d : t h e p ro ce s s a t t e m p t s t o a d d re s s a p a g e t h a t is b e in g s wa p p e d o u t . An a rra y o f s e m a p h o re s , o n e p e r e a ch p a g e s lo t , co u ld b e u s e d t o b lo ck t h e p ro ce s s u n t il t h e I/ O d a t a t ra n s fe r co m p le t e s . In m a n y ca s e s , h o we ve r, a p a g e is o wn e d b y s e ve ra l p ro ce s s e s . Ag a in , t h e s a m e a rra y o f s e m a p h o re s co u ld s u ffice t o a vo id ra ce co n d it io n s , p ro vid e d t h a t t h e ke rn e l is a b le t o lo ca t e q u ickly a ll Pa g e Ta b le e n t rie s t h a t re fe r t o t h e p a g e t o b e s wa p p e d o u t . Th e re fo re , t h e ke rn e l co u ld e n s u re t h a t e it h e r a ll p ro ce s s e s s e e t h e s a m e p a g e fra m e o r a ll o f t h e m s e e t h e s wa p p e d - o u t p a g e id e n t ifie r. Un fo rt u n a t e ly, t h e re is n o q u ick wa y in Lin u x 2 . 4 t o d e rive fro m t h e p a g e fra m e t h e lis t o f p ro ce s s e s t h a t o wn it . [ 4 ] S ca n n in g a ll Pa g e Ta b le e n t rie s o f a ll p ro ce s s e s lo o kin g fo r a n e n t ry wit h a g ive n p h ys ica l a d d re s s is ve ry co s t ly, a n d it is d o n e o n ly in ra re o cca s io n s ( fo r in s t a n ce , wh e n d e a ct iva t in g a s wa p a re a ) . [4]

On e o f t h e h o t fe a t u re s o f Lin u x 2 . 5 co n s is t s o f a d a t a s t ru ct u re t h a t a llo ws t h e ke rn e l t o q u ickly g e t a lis t o f a ll p ro ce s s e s t h a t s h a re a g ive n p a g e .

As a re s u lt , t h e s a m e p a g e m a y b e s wa p p e d o u t fo r s o m e p ro ce s s e s a n d p re s e n t in m e m o ry fo r o t h e rs . Th e ke rn e l a vo id s t h e ra ce co n d it io n s in d u ce d b y t h is p e cu lia r s ce n a rio b y m e a n s o f t h e s wa p ca ch e . Be fo re d e s crib in g h o w t h e s wa p ca ch e wo rks , le t 's re ca ll wh e n a p a g e fra m e m a y b e s h a re d a m o n g s e ve ra l p ro ce s s e s : ●





Th e p a g e fra m e is a s s o cia t e d wit h a s h a re d n o n a n o n ym o u s m e m o ry m a p p in g ( s e e Ch a p t e r 1 5 ) . Th e p a g e fra m e is h a n d le d b y m e a n s o f Co p y On Writ e , t yp ica lly b e ca u s e a n e w p ro ce s s h a s b e e n fo rke d o r b e ca u s e t h e p a g e fra m e b e lo n g s t o a p riva t e m e m o ry m a p p in g ( s e e S e ct io n 8 . 4 . 4 ) . Th e p a g e fra m e is a llo ca t e d t o a n IPC s h a re d m e m o ry re s o u rce ( s e e S e ct io n 1 9 . 3 . 5 ) o r t o a s h a re d a n o n ym o u s m e m o ry m a p p in g .

Of co u rs e , a p a g e fra m e is a ls o s h a re d b y s e ve ra l p ro ce s s e s if t h e y s h a re t h e m e m o ry d e s crip t o r a n d t h u s t h e wh o le s e t o f Pa g e Ta b le s . Re ca ll t h a t s u ch p ro ce s s e s a re cre a t e d b y p a s s in g t h e CLONE_VM fla g t o t h e clone( ) s ys t e m ca ll ( s e e S e ct io n 3 . 4 . 1 ) . All clo n e p ro ce s s e s , h o we ve r, co u n t a s a s in g le p ro ce s s a s fa r a s t h e s wa p p in g a lg o rit h m is co n ce rn e d . Th e re fo re , h e re we u s e t h e t e rm "p ro ce s s e s " t o m e a n "p ro ce s s e s o wn in g d iffe re n t m e m o ry d e s crip t o rs . " As we s h a ll s e e la t e r in t h is ch a p t e r, p a g e fra m e s u s e d fo r s h a re d n o n a n o n ym o u s m e m o ry m a p p in g s a re n e ve r s wa p p e d o u t . In s t e a d , t h e y a re h a n d le d b y a n o t h e r ke rn e l fu n ct io n t h a t

writ e s t h e ir d a t a t o t h e p ro p e r file s a n d d is ca rd s t h e m . Ho we ve r, t h e o t h e r t wo kin d s o f s h a re d p a g e fra m e s m u s t b e ca re fu lly h a n d le d b y t h e s wa p p in g a lg o rit h m b y m e a n s o f t h e s wa p ca ch e . Th e s wa p ca ch e co lle ct s s h a re d p a g e fra m e s t h a t h a ve b e e n co p ie d t o s wa p a re a s . It d o e s n o t e xis t a s a d a t a s t ru ct u re o n it s o wn ; in s t e a d , t h e p a g e s in t h e re g u la r p a g e ca ch e a re co n s id e re d t o b e in t h e s wa p ca ch e if ce rt a in fie ld s a re s e t . S h a re d p a g e s wa p p in g wo rks in t h e fo llo win g m a n n e r: co n s id e r a p a g e P t h a t is s h a re d a m o n g t wo p ro ce s s e s , A a n d B. S u p p o s e t h a t t h e s wa p p in g a lg o rit h m s ca n s t h e p a g e fra m e s o f p ro ce s s A a n d s e le ct s P fo r s wa p p in g o u t : it a llo ca t e s a n e w p a g e s lo t a n d co p ie s t h e d a t a s t o re d in P in t o t h e n e w p a g e s lo t . It t h e n p u t s t h e s wa p p e d - o u t p a g e id e n t ifie r in t h e co rre s p o n d in g Pa g e Ta b le e n t ry o f p ro ce s s A. Fin a lly, it in vo ke s _ _free_page( ) t o re le a s e t h e p a g e fra m e . Ho we ve r, t h e p a g e 's u s a g e co u n t e r d o e s n o t b e co m e 0 s in ce P is s t ill o wn e d b y B. Th u s , t h e s wa p p in g a lg o rit h m s u cce e d s in t ra n s fe rrin g t h e p a g e in t o t h e s wa p a re a , b u t fa ils t o re cla im t h e co rre s p o n d in g p a g e fra m e . S u p p o s e n o w t h a t t h e s wa p p in g a lg o rit h m s ca n s t h e p a g e fra m e s o f p ro ce s s B a t a la t e r t im e a n d s e le ct s P fo r s wa p p in g o u t . Th e ke rn e l m u s t re co g n ize t h a t P h a s a lre a d y b e e n t ra n s fe rre d in t o a s wa p a re a s o t h e p a g e wo n 't b e s wa p p e d o u t a s e co n d t im e . Mo re o ve r, it m u s t b e a b le t o d e rive t h e s wa p p e d - o u t p a g e id e n t ifie r s o it ca n in cre a s e t h e p a g e s lo t u s a g e co u n t e r. Fig u re 1 6 - 3 illu s t ra t e s s ch e m a t ica lly t h e a ct io n s p e rfo rm e d b y t h e ke rn e l o n a s h a re d p a g e t h a t is s wa p p e d o u t fro m m u lt ip le p ro ce s s e s a t d iffe re n t t im e s . Th e n u m b e rs in s id e t h e s wa p a re a a n d in s id e P re p re s e n t t h e p a g e s lo t u s a g e co u n t e r a n d t h e p a g e u s a g e co u n t e r, re s p e ct ive ly. No t ice t h a t e a ch u s a g e co u n t in clu d e s e ve ry p ro ce s s t h a t is u s in g t h e p a g e o r p a g e s lo t , p lu s t h e s wa p ca ch e if t h e p a g e is in clu d e d in it . Fo u r s t a g e s a re s h o wn : 1 . In ( a ) , P is p re s e n t in t h e Pa g e Ta b le s o f b o t h A a n d B. 2 . In ( b ) , P h a s b e e n s wa p p e d o u t fro m A's a d d re s s s p a ce . 3 . In ( c) , P h a s b e e n s wa p p e d o u t fro m b o t h t h e a d d re s s s p a ce s o f A a n d B, b u t is s t ill in clu d e d in t h e s wa p ca ch e . 4 . Fin a lly, in ( d ) , P h a s b e e n re le a s e d t o t h e b u d d y s ys t e m . Fig u re 1 6 - 3 . Th e ro le o f t h e s w a p c a c h e

Th e s wa p ca ch e is im p le m e n t e d b y t h e p a g e ca ch e d a t a s t ru ct u re s a n d p ro ce d u re s , wh ich a re d e s crib e d in S e ct io n 1 4 . 1 . Re ca ll t h a t t h e co re o f t h e p a g e ca ch e is a h a s h t a b le t h a t a llo ws t h e a lg o rit h m t o q u ickly d e rive t h e a d d re s s o f a p a g e d e s crip t o r fro m t h e a d d re s s o f a n address_space o b je ct id e n t ifyin g t h e o wn e r o f t h e p a g e a s we ll a s fro m a n o ffs e t va lu e .

Pa g e s in t h e s wa p ca ch e a re s t o re d like a n y o t h e r p a g e in t h e p a g e ca ch e , wit h t h e fo llo win g s p e cia l t re a t m e n t : ●

Th e mapping fie ld o f t h e p a g e d e s crip t o r p o in t s t o a n address_space o b je ct s t o re d in t h e swapper_space va ria b le .



Th e index fie ld s t o re s t h e s wa p p e d - o u t p a g e id e n t ifie r a s s o cia t e d wit h t h e p a g e .

Mo re o ve r, wh e n t h e p a g e is p u t in t h e s wa p ca ch e , b o t h t h e count fie ld o f t h e p a g e d e s crip t o r a n d t h e p a g e s lo t u s a g e co u n t e rs a re in cre m e n t e d , s in ce t h e s wa p ca ch e u s e s b o t h t h e p a g e fra m e a n d t h e p a g e s lo t .

16.3.1 Swap Cache Helper Functions Th e ke rn e l u s e s s e ve ra l fu n ct io n s t o h a n d le t h e s wa p ca ch e ; t h e y a re b a s e d m a in ly o n t h o s e d is cu s s e d in S e ct io n 1 4 . 1 . We s h o w la t e r h o w t h e s e re la t ive ly lo w- le ve l fu n ct io n s a re

in vo ke d b y h ig h e r- le ve l fu n ct io n s t o s wa p p a g e s in a n d o u t a s n e e d e d . Th e m a in fu n ct io n s t h a t h a n d le t h e s wa p ca ch e a re :

lookup_swap_cache( ) Fin d s a p a g e in t h e s wa p ca ch e t h ro u g h it s s wa p p e d - o u t p a g e id e n t ifie r p a s s e d a s a p a ra m e t e r a n d re t u rn s t h e p a g e a d d re s s . It re t u rn s 0 if t h e p a g e is n o t p re s e n t in t h e ca ch e . It in vo ke s find_ get_page( ), p a s s in g a s p a ra m e t e rs t h e a d d re s s o f t h e swapper_space p a g e a d d re s s s p a ce o b je ct a n d t h e s wa p p e d - o u t p a g e id e n t ifie r t o fin d t h e re q u ire d p a g e .

add_to_swap_cache( ) In s e rt s a p a g e in t o t h e s wa p ca ch e . It e s s e n t ia lly in vo ke s swap_duplicate( ) t o ch e ck wh e t h e r t h e p a g e s lo t p a s s e d a s a p a ra m e t e r is va lid a n d t o in cre m e n t t h e p a g e s lo t u s a g e co u n t e r; find_get_page( ) t o m a ke s u re t h a t n o o t h e r p a g e wit h t h e s a m e address_space o b je ct a n d o ffs e t a lre a d y e xis t s ; add_to_page_cache(

) t o in s e rt t h e p a g e in t o t h e ca ch e ; a n d lru_cache_add( ) t o in s e rt t h e p a g e in t h e in a ct ive lis t ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 7 . 2 ) .

delete_from_swap_cache( ) Re m o ve s a p a g e fro m t h e s wa p ca ch e b y flu s h in g it s co n t e n t t o d is k, cle a rin g t h e PG_dirty fla g , a n d in vo kin g remove_page_from_inode_queue( ) and

remove_page_from_hash_queue( ) ( s e e S e ct io n 1 4 . 1 . 2 ) . free_page_and_swap_cache( ) Re le a s e s a p a g e b y in vo kin g _ _free_page( ). If t h e ca lle r is t h e o n ly p ro ce s s t h a t o wn s t h e p a g e , t h is fu n ct io n a ls o re m o ve s t h e p a g e fro m t h e a ct ive o r in a ct ive lis t ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 7 . 2 ) , re m o ve s t h e p a g e fro m t h e s wa p ca ch e b y in vo kin g delete_from_swap_cache( ), a n d fre e s t h e p a g e s lo t o n t h e s wa p a re a b y flu s h in g t h e p a g e co n t e n t s t o d is k a n d in vo kin g swap_free( ).

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.4 Transferring Swap Pages Tra n s fe rrin g s wa p p a g e s wo u ld n 't b e s o co m p lica t e d if t h e re we re n 't s o m a n y ra ce co n d it io n s a n d o t h e r p o t e n t ia l h a za rd s t o g u a rd a g a in s t . He re a re s o m e o f t h e t h in g s t h a t h a ve t o b e ch e cke d re g u la rly: ●



Th e p ro ce s s t h a t o wn s a p a g e m a y t e rm in a t e wh ile t h e p a g e is b e in g s wa p p e d in o r out. An o t h e r p ro ce s s m a y b e in t h e m id d le o f s wa p p in g in a p a g e t h a t t h e cu rre n t o n e is t ryin g t o s wa p o u t ( o r vice ve rs a ) .

Like a n y o t h e r d is k a cce s s t yp e , I/ O d a t a t ra n s fe rs fo r s wa p p a g e s a re b lo ckin g o p e ra t io n s . Th e re fo re , t h e ke rn e l m u s t t a ke ca re t o a vo id s im u lt a n e o u s t ra n s fe rs in vo lvin g t h e s a m e p a g e fra m e , t h e s a m e p a g e s lo t , o r b o t h . Ra ce co n d it io n s ca n b e a vo id e d o n t h e p a g e fra m e t h ro u g h t h e m e ch a n is m s d is cu s s e d in Ch a p t e r 1 3 . S p e cifica lly, b e fo re s t a rt in g a n I/ O o p e ra t io n o n t h e p a g e fra m e , t h e ke rn e l wa it s u n t il it s PG_locked fla g is o ff. Wh e n t h e fu n ct io n re t u rn s , t h e p a g e fra m e lo ck h a s b e e n a cq u ire d , a n d t h e re fo re n o o t h e r ke rn e l co n t ro l p a t h ca n a cce s s t h e p a g e fra m e 's co n t e n t s d u rin g t h e I/ O o p e ra t io n . Bu t t h e s t a t e o f t h e p a g e s lo t m u s t a ls o b e t ra cke d . Th e PG_locked fla g o f t h e p a g e d e s crip t o r is u s e d o n ce a g a in t o e n s u re e xclu s ive a cce s s t o t h e p a g e s lo t in vo lve d in t h e I/ O d a t a t ra n s fe r. Be fo re s t a rt in g a n I/ O o p e ra t io n o n a s wa p p a g e , t h e ke rn e l ch e cks t h a t t h e p a g e fra m e in vo lve d is in clu d e d in t h e s wa p ca ch e ; if n o t , it a d d s t h e p a g e fra m e in t o t h e s wa p ca ch e . Le t 's s u p p o s e s o m e p ro ce s s t rie s t o s wa p in a p a g e wh ile t h e s a m e p a g e is cu rre n t ly b e in g t ra n s fe rre d . Be fo re d o in g a n y wo rk re la t e d t o t h e s wa p in , t h e ke rn e l lo o ks in t h e s wa p ca ch e fo r a p a g e fra m e a s s o cia t e d wit h t h e g ive n s wa p p e d - o u t p a g e id e n t ifie r. S in ce t h e p a g e fra m e is fo u n d , t h e ke rn e l kn o ws t h a t it m u s t n o t a llo ca t e a n e w p a g e fra m e , b u t m u s t s im p ly u s e t h e ca ch e d p a g e fra m e . Mo re o ve r, s in ce t h e PG_locked fla g is s e t , t h e ke rn e l s u s p e n d s t h e ke rn e l co n t ro l p a t h u n t il t h e b it b e co m e s 0 , s o t h a t b o t h t h e p a g e fra m e 's co n t e n t s a n d t h e p a g e s lo t in t h e s wa p a re a a re p re s e rve d u n t il t h e I/ O o p e ra t io n t e rm in a t e s . In s h o rt , t h a n ks t o t h e s wa p ca ch e , t h e PG_locked fla g o f t h e p a g e fra m e a ls o a ct s a s a lo ck fo r t h e p a g e s lo t in t h e s wa p a re a .

16.4.1 The rw_swap_ page( ) Function Th e rw_swap_page( ) fu n ct io n is u s e d t o s wa p in o r s wa p o u t a p a g e . It re ce ive s t h e fo llo win g p a ra m e t e rs :

rw A fla g s p e cifyin g t h e d ire ct io n o f d a t a t ra n s fe r: READ fo r s wa p p in g in , WRITE fo r s wa p p in g o u t .

page

Th e a d d re s s o f a d e s crip t o r o f a p a g e in t h e s wa p ca ch e . Be fo re in vo kin g t h e fu n ct io n , t h e ca lle r m u s t e n s u re t h a t t h e p a g e is in clu d e d in t h e s wa p ca ch e a n d lo ck t h e p a g e t o p re ve n t ra ce co n d it io n s d u e t o co n cu rre n t a cce s s e s t o t h e p a g e fra m e o r t o t h e p a g e s lo t in t h e s wa p a re a , a s d e s crib e d in t h e p re vio u s s e ct io n . To b e o n t h e s a fe s id e , t h e rw_swap_page( ) fu n ct io n ch e cks t h a t t h e s e t wo co n d it io n s e ffe ct ive ly h o ld , a n d t h e n g e t s t h e s wa p p e d - o u t p a g e id e n t ifie r fro m page->index a n d in vo ke s t h e

rw_swap_page_base( ) fu n ct io n , p a s s in g t o it t h e p a g e id e n t ifie r, t h e p a g e d e s crip t o r a d d re s s page, a n d t h e d ire ct io n fla g rw. Th e rw_swap_page_base( ) fu n ct io n is t h e co re o f t h e s wa p p in g a lg o rit h m ; it p e rfo rm s t h e fo llo win g s t e p s : 1 . If t h e d a t a t ra n s fe r is fo r a s wa p - in o p e ra t io n ( rw s e t t o READ) , it cle a rs t h e

PG_uptodate fla g o f t h e p a g e fra m e . Th e fla g is s e t a g a in o n ly if t h e s wa p - in o p e ra t io n t e rm in a t e s s u cce s s fu lly. 2 . Ge t s t h e p ro p e r s wa p a re a d e s crip t o r a n d t h e s lo t in d e x fro m t h e s wa p p e d - o u t p a g e id e n t ifie r. 3 . If t h e s wa p a re a is a d is k p a rt it io n , g e t s t h e co rre s p o n d in g b lo ck d e vice n u m b e r fro m t h e swap_device fie ld o f t h e s wa p a re a d e s crip t o r. In t h is ca s e , t h e s lo t in d e x a ls o re p re s e n t s t h e lo g ica l b lo ck n u m b e r o f t h e re q u e s t e d d a t a b e ca u s e t h e b lo ck s ize o f a n y s wa p d is k p a rt it io n is a lwa ys e q u a l t o t h e p a g e s ize ( PAGE_SIZE) .

4 . Ot h e rwis e , if t h e s wa p a re a is a re g u la r file , it e xe cu t e s t h e fo llo win g s u b s t e p s : a . Ge t s t h e n u m b e r o f t h e b lo ck d e vice t h a t s t o re s t h e file fro m t h e i_dev fie ld o f it s in o d e o b je ct ( t h e swap_files->d_inode fie ld in t h e s wa p a re a d e s crip t o r) . b . Ge t s t h e b lo ck s ize o f t h e d e vice ( t h e i_sb->s_blocksize fie ld o f t h e in o d e ) . c. Co m p u t e s t h e file b lo ck n u m b e r co rre s p o n d in g t o t h e g ive n s lo t in d e x. d . Fills a lo ca l a rra y wit h t h e lo g ica l b lo ck n u m b e rs o f t h e b lo cks in t h e p a g e s lo t ; e ve ry lo g ica l b lo ck n u m b e r is o b t a in e d b y in vo kin g t h e bmap m e t h o d o f t h e address_space o b je ct wh o s e a d d re s s is s t o re d in t h e i_mapping fie ld o f t h e in o d e . If t h e bmap m e t h o d fa ils , rw_swap_page_base( ) re t u rn s 0 ( fa ilu re ) . 5 . In vo ke s t h e brw_page( ) fu n ct io n t o s t a rt a p a g e I/ O o p e ra t io n o n t h e b lo ck ( o r b lo cks ) id e n t ifie d in t h e p re vio u s s t e p s a n d re t u rn s 1 ( s u cce s s ) . S in ce t h e p a g e I/ O o p e ra t io n a ct iva t e d b y brw_page( ) is a s yn ch ro n o u s , t h e

rw_swap_page( ) fu n ct io n m ig h t t e rm in a t e b e fo re t h e a ct u a l I/ O d a t a t ra n s fe r co m p le t e s . Ho we ve r, a s d e s crib e d in S e ct io n 1 3 . 4 . 8 . 2 , t h e ke rn e l e ve n t u a lly e xe cu t e s t h e

end_buffer_io_async( ) fu n ct io n ( wh ich ve rifie s t h a t a ll d a t a t ra n s fe rs s u cce s s fu lly

co m p le t e d ) , u n lo cks t h e p a g e , a n d s e t s it s PG_uptodate fla g .

16.4.2 The read_swap_cache_async( ) Function Th e read_swap_cache_async( ) fu n ct io n , wh ich re ce ive s a s a p a ra m e t e r a s wa p p e d - o u t p a g e id e n t ifie r, is in vo ke d wh e n e ve r t h e ke rn e l m u s t s wa p in a p a g e . As we kn o w, b e fo re a cce s s in g t h e s wa p p a rt it io n , t h e fu n ct io n m u s t ch e ck wh e t h e r t h e s wa p ca ch e a lre a d y in clu d e s t h e d e s ire d p a g e fra m e . Th e re fo re , t h e fu n ct io n e s s e n t ia lly e xe cu t e s t h e fo llo win g o p e ra t io n s : 1 . In vo ke s find_get_page( ) t o s e a rch fo r t h e p a g e in t h e s wa p ca ch e . If t h e p a g e is fo u n d , it re t u rn s t h e a d d re s s o f it s d e s crip t o r. 2 . Th e p a g e is n o t in clu d e d in t h e s wa p ca ch e . In vo ke s alloc_page( ) t o a llo ca t e a n e w p a g e fra m e . If n o fre e p a g e fra m e is a va ila b le , it re t u rn s 0 ( in d ica t in g t h e s ys t e m is o u t o f m e m o ry) . 3 . In vo ke s add_to_swap_cache( ) t o in s e rt t h e n e w p a g e fra m e in t o t h e s wa p ca ch e . As m e n t io n e d in t h e e a rlie r s e ct io n S e ct io n 1 6 . 3 . 1 , t h is fu n ct io n a ls o lo cks t h e page. 4 . Th e p re vio u s s t e p m ig h t fa il if add_to_swap_cache( ) fin d s a d u p lica t e o f t h e p a g e in t h e s wa p ca ch e . Fo r in s t a n ce , t h e p ro ce s s co u ld b lo ck in S t e p 2 , t h u s a llo win g a n o t h e r p ro ce s s t o s t a rt a s wa p - in o p e ra t io n o n t h e s a m e p a g e s lo t . In t h is ca s e , t h e fu n ct io n re le a s e s t h e p a g e fra m e a llo ca t e d in S t e p 3 a n d re s t a rt s fro m S t e p 1. 5 . Ot h e rwis e , t h e n e w p a g e fra m e is in s e rt e d in t o t h e s wa p ca ch e . In vo ke s rw_swap_page( ) t o re a d t h e p a g e 's co n t e n t s fro m t h e s wa p a re a , p a s s in g t h e

READ p a ra m e t e r a n d t h e p a g e d e s crip t o r t o t h a t fu n ct io n . 6 . Re t u rn s t h e a d d re s s o f t h e p a g e d e s crip t o r.

16.4.3 The rw_swap_ page_nolock( ) Function Th e re is ju s t o n e ca s e in wh ich t h e ke rn e l wa n t s t o re a d a p a g e fro m a s wa p a re a wit h o u t p u t t in g it in t h e s wa p ca ch e . Th is h a p p e n s wh e n s e rvicin g t h e swapon( ) s ys t e m ca ll: t h e ke rn e l re a d s t h e firs t p a g e o f a s wa p a re a , wh ich co n t a in s t h e swap_header u n io n , a n d t h e n im m e d ia t e ly d is ca rd s t h e p a g e fra m e . S in ce t h e ke rn e l is a ct iva t in g t h e s wa p a re a , n o p ro ce s s ca n s wa p in o r s wa p o u t a p a g e o n it , s o t h e re is n o n e e d t o p ro t e ct t h e a cce s s t o t h e p a g e s lo t . Th e rw_swap_page_nolock ( ) fu n ct io n re ce ive s a s p a ra m e t e rs t h e t yp e o f I/ O o p e ra t io n ( READ o r WRITE) , a s wa p p e d - o u t p a g e id e n t ifie r, a n d t h e a d d re s s o f a p a g e fra m e ( a lre a d y lo cke d ) . It p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Ge t s t h e p a g e d e s crip t o r o f t h e p a g e fra m e p a s s e d a s a p a ra m e t e r. 2 . In it ia lize s t h e swapping fie ld o f t h e p a g e d e s crip t o r wit h t h e a d d re s s o f t h e

swapper_space o b je ct ; t h is is d o n e b e ca u s e t h e sync_page m e t h o d is e xe cu t e d in Ste p 4. 3 . In vo ke s rw_swap_page_base( ) t o s t a rt t h e I/ O s wa p o p e ra t io n .

4 . Wa it s u n t il t h e I/ O d a t a t ra n s fe r co m p le t e s b y in vo kin g wait_on_page( ).

5 . Un lo cks t h e p a g e . 6 . S e t s t h e mapping fie ld o f t h e p a g e d e s crip t o r t o NULL a n d re t u rn s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.5 Swapping Out Pages Th e la t e r s e ct io n S e ct io n 1 6 . 7 e xp la in s wh a t h a p p e n s wh e n p a g e s a re s wa p p e d o u t . As we in d ica t e d a t t h e b e g in n in g o f t h is ch a p t e r, s wa p p in g o u t p a g e s is a la s t re s o rt a n d a p p e a rs a s p a rt o f a g e n e ra l s t ra t e g y t o fre e m e m o ry t h a t u s e s o t h e r t a ct ics a s we ll. In t h is s e ct io n , we s h o w h o w t h e ke rn e l p e rfo rm s a s wa p o u t . Th is is a ch ie ve d b y a s e rie s o f fu n ct io n s ca lle d in ca s ca d in g fa s h io n . Le t 's s t a rt wit h t h e fu n ct io n s a t t h e h ig h e r le ve l. Th e swap_out( ) fu n ct io n a ct s o n a s in g le classzone p a ra m e t e r t h a t s p e cifie s t h e m e m o ry zo n e fro m wh ich p a g e s s h o u ld b e s wa p p e d o u t ( s e e S e ct io n 7 . 1 . 2 ) . Two o t h e r p a ra m e t e rs , priority a n d gfp_mask, a re n o t u s e d .

Th e swap_out( ) fu n ct io n s ca n s e xis t in g m e m o ry d e s crip t o rs a n d t rie s t o s wa p o u t t h e p a g e s re fe re n ce d in e a ch p ro ce s s 's Pa g e Ta b le s . It t e rm in a t e s a s s o o n a s o n e o f t h e fo llo win g co n d it io n s o ccu rs : ●

Th e fu n ct io n s u cce e d s in re le a s in g SWAP_CLUSTER_MAX p a g e fra m e s ( b y d e fa u lt ,



3 2 ) . A p a g e fra m e is co n s id e re d re le a s e d wh e n it is re m o ve d fro m t h e Pa g e Ta b le s o f a ll p ro ce s s e s t h a t s h a re it . Th e fu n ct io n s ca n s n m e m o ry d e s crip t o rs , wh e re n is t h e le n g t h o f t h e m e m o ry d e s crip t o r lis t wh e n t h e fu n ct io n s t a rt s . [ 5 ] [5]

Th e swap_out( ) fu n ct io n ca n b lo ck, s o m e m o ry d e s crip t o rs m ig h t a p p e a r a n d d is a p p e a r o n t h e lis t d u rin g a s in g le in vo ca t io n o f t h e fu n ct io n . To e n s u re t h a t a ll p ro ce s s e s a re e ve n ly p e n a lize d b y swap_out( ), t h e fu n ct io n s t a rt s s ca n n in g t h e lis t fro m t h e m e m o ry d e s crip t o r t h a t wa s la s t a n a lyze d in t h e p re vio u s in vo ca t io n ; t h e a d d re s s o f t h is m e m o ry d e s crip t o r is s t o re d in t h e swap_mm g lo b a l va ria b le .

Fo r e a ch m e m o ry d e s crip t o r mm t o b e co n s id e re d , t h e swap_out( ) fu n ct io n in cre m e n t s t h e u s a g e co u n t e r mm->mm_users, t h u s e n s u rin g t h a t t h e m e m o ry d e s crip t o r ca n n o t d is a p p e a r fro m t h e lis t wh ile t h e s wa p p in g a lg o rit h m is wo rkin g o n it . Th e n , swap_out( ) in vo ke s t h e

swap_out_mm( ) fu n ct io n , p a s s in g t o it t h e m e m o ry d e s crip t o r a d d re s s mm, t h e m e m o ry zo n e classzone, a n d t h e n u m b e r o f p a g e fra m e s s t ill t o b e re le a s e d . On ce swap_out_mm( ) re t u rn s , swap_out( ) d e cre m e n t s t h e u s a g e co u n t e r mm->mm_users, a n d t h e n d e cid e s wh e t h e r it s h o u ld a n a lyze t h e n e xt m e m o ry d e s crip t o r in t h e lis t o r ju s t t e rm in a t e .

swap_out_mm( ) re t u rn s t h e n u m b e r o f p a g e s o f t h e p ro ce s s t h a t o wn s t h e m e m o ry d e s crip t o r t h a t t h e fu n ct io n h a s re le a s e d . Th e swap_out( ) fu n ct io n u s e s t h is va lu e t o u p d a t e a co u n t e r o f h o w m a n y p a g e s h a ve b e e n re le a s e d s in ce t h e b e g in n in g o f it s e xe cu t io n ; if t h e co u n t e r re a ch e s t h e va lu e SWAP_CLUSTER_MAX, swap_out( ) t e rm in a t e s .

Th e swap_out_mm( ) fu n ct io n s ca n s t h e m e m o ry re g io n s o f t h e p ro ce s s t h a t o wn s t h e m e m o ry d e s crip t o r mm p a s s e d a s a p a ra m e t e r. Us u a lly, t h e fu n ct io n s t a rt s a n a lyzin g t h e firs t m e m o ry re g io n o b je ct in t h e mm->mmap lis t ( re m e m b e r t h a t t h e y a re o rd e re d b y s t a rt in g lin e a r a d d re s s e s ) . Ho we ve r, if mm is t h e m e m o ry d e s crip t o r t h a t wa s a n a lyze d la s t in t h e

p re vio u s in vo ca t io n o f swap_out( ), swap_out_mm( ) d o e s n o t re s t a rt fro m t h e firs t m e m o ry re g io n , b u t fro m t h e m e m o ry re g io n t h a t in clu d e s t h e lin e a r a d d re s s la s t a n a lyze d in t h e p re vio u s in vo ca t io n . Th is lin e a r a d d re s s is s t o re d in t h e swap_address fie ld o f t h e m e m o ry d e s crip t o r; if a ll m e m o ry re g io n s o f t h e p ro ce s s h a ve b e e n a n a lyze d , t h e n t h e fie ld s t o re s t h e co n ve n t io n a l va lu e TASK_SIZE.

Fo r e a ch m e m o ry re g io n o f t h e p ro ce s s t h a t o wn s t h e m e m o ry d e s crip t o r mm,

swap_out_mm( ) in vo ke s t h e swap_out_vma( ) fu n ct io n , p a s s in g t o it t h e n u m b e r o f p a g e s ye t t o b e re le a s e d , t h e firs t lin e a r a d d re s s t o a n a lyze , t h e m e m o ry re g io n o b je ct , a n d t h e m e m o ry d e s crip t o r. Ag a in , swap_out_vma( ) re t u rn s t h e n u m b e r o f re le a s e d p a g e s b e lo n g in g t o t h e m e m o ry re g io n . Th e lo o p o f swap_out_mm( ) co n t in u e s u n t il e it h e r t h e re q u e s t e d n u m b e r o f p a g e s is re le a s e d o r a ll m e m o ry re g io n s a re co n s id e re d . Th e swap_out_vma( ) fu n ct io n ch e cks t h a t t h e m e m o ry re g io n is s wa p p a b le ( e . g . , t h e fla g

VM_RESERVED is cle a re d ) . It t h e n s t a rt s a s e q u e n ce in wh ich it co n s id e rs a ll e n t rie s in t h e p ro ce s s 's Pa g e Glo b a l Dire ct o ry t h a t re fe r t o lin e a r a d d re s s e s in t h e m e m o ry re g io n . Fo r e a ch s u ch e n t ry, t h e fu n ct io n in vo ke s t h e swap_out_pgd( ) fu n ct io n , wh ich in t u rn co n s id e rs a ll e n t rie s in a Pa g e Mid d le Dire ct o ry co rre s p o n d in g t o a d d re s s in t e rva ls in t h e m e m o ry re g io n . Fo r e a ch s u ch e n t ry, swap_out_pgd( ) in vo ke s t h e swap_out_pmd( ) fu n ct io n , wh ich co n s id e rs a ll e n t rie s in a Pa g e Ta b le re fe re n cin g p a g e s in t h e m e m o ry re g io n . Als o , swap_out_pmd( ) in vo ke s t h e try_to_swap_out( ) fu n ct io n , wh ich fin a lly a t t e m p t s t o s wa p o u t t h e p a g e . As u s u a l, t h is ch a in o f fu n ct io n in vo ca t io n s b re a ks a s s o o n a s t h e re q u e s t e d n u m b e r o f re le a s e d p a g e fra m e s is re a ch e d .

16.5.1 The try_to_swap_out( ) Function Th e try_to_swap_out( ) fu n ct io n a t t e m p t s t o fre e a g ive n p a g e fra m e , e it h e r d is ca rd in g o r s wa p p in g o u t it s co n t e n t s . Th e fu n ct io n re t u rn s t h e va lu e 1 if it s u cce e d s in re le a s in g t h e p a g e , a n d 0 o t h e rwis e . Re m e m b e r t h a t b y "re le a s in g t h e p a g e , " we m e a n t h a t t h e re fe re n ce s t o t h e p a g e fra m e a re re m o ve d fro m t h e Pa g e Ta b le s o f a ll p ro ce s s e s t h a t s h a re t h e p a g e . In t h is ca s e , h o we ve r, t h e p a g e fra m e is n o t n e ce s s a rily re le a s e d t o t h e b u d d y s ys t e m ; fo r in s t a n ce , it co u ld b e re fe re n ce d b y t h e s wa p ca ch e . Th e p a ra m e t e rs o f t h e fu n ct io n a re :

mm Me m o ry d e s crip t o r a d d re s s

vma Me m o ry re g io n o b je ct a d d re s s

address In it ia l lin e a r a d d re s s o f t h e p a g e

page_table

Ad d re s s o f t h e Pa g e Ta b le e n t ry t h a t m a p s address

page Pa g e d e s crip t o r a d d re s s

classzone Th e m e m o ry zo n e fro m wh ich p a g e s s h o u ld b e s wa p p e d o u t Th e try_to_swap_out( ) fu n ct io n u s e s t h e Accessed a n d Dirty fla g s in clu d e d in t h e Pa g e Ta b le e n t ry. We s t a t e d in S e ct io n 2 . 4 . 1 t h a t t h e Accessed fla g is a u t o m a t ica lly s e t b y t h e CPU's p a g in g u n it a t e ve ry re a d o r writ e a cce s s , wh ile t h e Dirty fla g is a u t o m a t ica lly s e t a t e ve ry writ e a cce s s . Th e s e t wo fla g s o ffe r a lim it e d d e g re e o f h a rd wa re s u p p o rt t h a t a llo ws t h e ke rn e l t o u s e a p rim it ive LRU re p la ce m e n t a lg o rit h m .

try_to_swap_out( ) m u s t re co g n ize m a n y d iffe re n t s it u a t io n s d e m a n d in g d iffe re n t re s p o n s e s , b u t t h e re s p o n s e s a ll s h a re m a n y o f t h e s a m e b a s ic o p e ra t io n s . In p a rt icu la r, t h e fu n ct io n p e rfo rm s t h e fo llo win g s t e p s : 1 . Ch e cks t h e Accessed fla g o f t h e page_table e n t ry. If it is s e t , t h e p a g e m u s t b e co n s id e re d "yo u n g "; in t h is ca s e , t h e fu n ct io n cle a rs t h e fla g , in vo ke s mark_page_accessed( ) ( s e e S e ct io n 1 6 . 7 . 2 la t e r in t h is ch a p t e r) , a n d re t u rn s 0 . Th is ch e ck e n s u re s t h a t a p a g e ca n b e s wa p p e d o u t o n ly if it wa s n o t a cce s s e d s in ce t h e p re vio u s in vo ca t io n o f try_to_swap_out( ) o n it .

2 . If t h e m e m o ry re g io n is lo cke d ( VM_LOCKED fla g s e t ) , in vo ke s

mark_page_accessed( ) o n it , a n d re t u rn s 0 . 3 . If t h e PG_active fla g in t h e page->flags fie ld is s e t , t h e p a g e is co n s id e re d a ct ive ly u s e d a n d s h o u ld n 't b e s wa p p e d o u t ; t h e fu n ct io n re t u rn s 0 . 4 . If t h e p a g e d o e s n o t b e lo n g t o t h e m e m o ry zo n e s p e cifie d b y t h e classzone p a ra m e t e r, re t u rn s 0 . 5 . Trie s t o lo ck t h e p a g e ; if it is a lre a d y lo cke d ( PG_locked fla g s e t ) , it is n o t p o s s ib le t o s wa p o u t t h e p a g e b e ca u s e it is in vo lve d in a n I/ O d a t a t ra n s fe r; t h e fu n ct io n re t u rn s 0 . 6 . At t h is p o in t , t h e fu n ct io n kn o ws t h a t t h e p a g e ca n b e s wa p p e d o u t . Fo rce s t h e va lu e ze ro in t o t h e Pa g e Ta b le e n t ry a d d re s s e d b y page_table a n d in vo ke s

flush_tlb_page( ) t o in va lid a t e t h e co rre s p o n d in g TLB e n t rie s . 7 . If t h e Dirty fla g in t h e Pa g e Ta b le e n t ry wa s s e t , in vo ke s t h e set_page_dirty( ) fu n ct io n t o s e t t h e PG_dirty fla g in t h e p a g e d e s crip t o r. Mo re o ve r, t h is fu n ct io n m o ve s t h e p a g e in t h e dirty_pages lis t o f t h e address_space o b je ct re fe re n ce d b y page->mapping, if a n y, a n d m a rks t h e in o d e page->mapping->host a s d irt y ( s e e S e ct io n 1 4 . 1 . 2 . 2 ) .

8 . If t h e p a g e b e lo n g s t o t h e s wa p ca ch e , it p e rfo rm s t h e fo llo win g s u b s t e p s : a . Ge t s t h e s wa p p e d - o u t p a g e id e n t ifie r fro m page->index.

b . In vo ke s swap_duplicate( ) t o ve rify wh e t h e r t h e p a g e s lo t in d e x is va lid a n d t o in cre m e n t t h e co rre s p o n d in g u s a g e co u n t e r in swap_map.

c. S t o re s t h e s wa p p e d - o u t p a g e id e n t ifie r in t h e Pa g e Ta b le e n t ry a d d re s s e d b y page_table.

d . De cre m e n t s t h e rss fie ld o f t h e m e m o ry d e s crip t o r mm.

e . Un lo cks t h e p a g e . f. De cre m e n t s t h e p a g e u s a g e co u n t e r page->count.

g . If t h e p a g e is n o lo n g e r re fe re n ce d b y a n y p ro ce s s , it re t u rn s 1 ; o t h e rwis e , it re t u rn s 0 . [ 6 ] [6]

Th e ch e ck is e a s ily d o n e b y lo o kin g a t t h e va lu e o f t h e

page->count u s a g e co u n t e r. Of co u rs e , t h e fu n ct io n m u s t co n s id e r t h a t t h e co u n t e r is in cre m e n t e d wh e n t h e p a g e is in s e rt e d in t o t h e s wa p ca ch e ( o r t h e p a g e ca ch e ) , a n d wh e n t h e re a re b u ffe rs a llo ca t e d o n t h e p a g e ( i. e . , wh e n t h e page-

>buffers fie ld is n o t n u ll) . No t ice t h a t t h e fu n ct io n d o e s n o t h a ve t o a llo ca t e a n e w p a g e s lo t , b e ca u s e t h e p a g e fra m e h a s a lre a d y b e e n s wa p p e d o u t wh e n s ca n n in g t h e Pa g e Ta b le s o f s o m e o t h e r p ro ce s s . ●

Th e p a g e is n o t in s e rt e d in t o t h e s wa p ca ch e . Ch e cks wh e t h e r t h e p a g e b e lo n g s t o a n

address_space o b je ct ( t h e page->mapping fie ld is n o t n u ll) ; in t h is ca s e , t h e p a g e b e lo n g s t o a s h a re d file m e m o ry m a p p in g , s o t h e fu n ct io n ju m p s t o S t e p 8 d t o re le a s e t h e p a g e fra m e , le a vin g t h e co rre s p o n d in g Pa g e Ta b le e n t ry n u ll. No t ice t h a t t h e p a g e fra m e re fe re n ce o f t h e p ro ce s s is re le a s e d e ve n if t h e p a g e is n o t s a ve d in t o a s wa p a re a . Th is is b e ca u s e t h e p a g e h a s a n im a g e o n d is k, a n d t h e fu n ct io n h a s a lre a d y t rig g e re d , if n e ce s s a ry, t h e u p d a t e o f t h is im a g e in S t e p 7 . Mo re o ve r, n o t ice a ls o t h a t t h e p a g e fra m e is n o t re le a s e d t o t h e b u d d y s ys t e m b e ca u s e t h e p a g e is s t ill o wn e d b y t h e p a g e ca ch e ( s e e S e ct io n 1 4 . 1 . 2 . 3 ) . If t h e fu n ct io n re a ch e s t h is p o in t , t h e p a g e is n o t in s e rt e d in t o t h e s wa p ca ch e , a n d it d o e s n o t b e lo n g t o a n address_space o b je ct . Th e fu n ct io n ch e cks t h e s t a t u s o f t h e



PG_dirty fla g ; if it is cle a re d , t h e fu n ct io n ju m p s t o S t e p 8 d t o re le a s e t h e p a g e fra m e , le a vin g t h e co rre s p o n d in g Pa g e Ta b le e n t ry n u ll. Th e re is n o n e e d t o s a ve t h e p a g e co n t e n t s o n a s wa p a re a b e ca u s e t h e p ro ce s s n e ve r wro t e in t o t h e p a g e fra m e . Th e ke rn e l re co g n ize s t h is ca s e b e ca u s e t h e PG_dirty fla g is cle a re d , a n d t h is fla g is n e ve r re s e t if t h e p a g e h a s n o im a g e o n d is k o r if it b e lo n g s t o a p riva t e

m e m o ry m a p p in g . Wh e n t h e p ro ce s s a cce s s e s t h e s a m e p a g e a g a in , t h e ke rn e l h a n d le s t h e Pa g e Fa u lt t h ro u g h t h e d e m a n d p a g in g t e ch n iq u e ( s e e S e ct io n 8 . 4 . 3 ) ; t h e n t h e n e w p a g e fra m e is fille d wit h e xa ct ly t h e s a m e d a t a a s t h a t s t o re d in t h is re le a s e d p a g e fra m e . If t h e fu n ct io n re a ch e s t h is p o in t , t h e p a g e is n o t in s e rt e d in t o t h e s wa p ca ch e , it d o e s n o t h a ve a n im a g e o n d is k, a n d it is d irt y; h e re t h e fu n ct io n ch e cks wh e t h e r t h e p a g e co n t a in s b u ffe rs ( it is a b u ffe r p a g e , it s page->buffers fie ld is n o t n u ll) . In t h is ca s e , t h e



fu n ct io n re s t o re s t h e o rig in a l co n t e n t s o f t h e Pa g e Ta b le e n t ry, u n lo cks t h e p a g e , a n d re t u rn s 0 . Ho w co u ld t h e p a g e h o s t s o m e b u ffe rs if t h e p a g e d o e s n 't b e lo n g t o a n address_space o b je ct —t h a t is , it h a s n o im a g e o n d is k? Act u a lly, t h is m ig h t o ccu r in ra re circu m s t a n ce s —fo r in s t a n ce , if t h e p a g e m a p s a p o rt io n o f a file t h a t h a s ju s t b e e n t ru n ca t e d . In t h e s e ca s e s , try_to_swapout( ) d o e s n o t h in g .

At t h is p o in t , t h e p a g e is n o t in s e rt e d in t o t h e s wa p ca ch e , it d o e s n o t h a ve a n im a g e o n d is k, a n d it is d irt y; t h e fu n ct io n m u s t d e fin it ive ly s wa p it o u t in a n e w p a g e s lo t . It in vo ke s t h e get_swap_page( ) fu n ct io n t o a llo ca t e a fre e p a g e s lo t in a n a ct ive s wa p a re a . If t h e re



a re n o n e , it re s t o re s t h e o rig in a l co n t e n t o f t h e Pa g e Ta b le e n t ry, u n lo cks t h e p a g e , a n d re t u rn s 0 . ●

In vo ke s add_to_swap_cache( ) t o in s e rt t h e p a g e in t h e s wa p ca ch e . Th e fu n ct io n

m ig h t fa il if a n o t h e r ke rn e l co n t ro l p a t h is t ryin g t o s wa p in t h e p a g e . As we s h a ll s e e in t h e n e xt s e ct io n , t h is ca n h a p p e n e ve n if t h e p a g e s lo t is n o t re fe re n ce d b y a n y p ro ce s s . In t h is ca s e , it in vo ke s swap_free( ) t o re le a s e t h e p a g e s lo t a n d re s t a rt s fro m S t e p 1 2 .



S e t s t h e PG_uptodate fla g o f t h e p a g e .



In vo ke s t h e set_page_dirty( ) fu n ct io n a g a in ( s e e S t e p 7 a b o ve ) b e ca u s e

add_to_swap_cache( ) re s e t s t h e PG_dirty fla g . Ju m p s t o S t e p 8 c t o s t o re t h e s wa p p e d - o u t p a g e id e n t ifie r in t h e Pa g e Ta b le e n t ry a n d t o re le a s e t h e p a g e fra m e .



Th e try_to_swap_out( ) fu n ct io n d o e s n o t d ire ct ly in vo ke rw_swap_page( ) t o t rig g e r t h e a ct iva t io n o f t h e I/ O d a t a t ra n s fe r. Ra t h e r, t h e fu n ct io n lim it s it s e lf t o in s e rt in g t h e p a g e in t h e s wa p ca ch e , if n e ce s s a ry, a n d t o m a rkin g t h e p a g e a s d irt y. Ho we ve r, we 'll s e e in t h e la t e r s e ct io n S e ct io n 1 6 . 7 . 4 t h a t t h e ke rn e l p e rio d ica lly flu s h e s t h e d is k ca ch e s t o d is k b y in vo kin g t h e writepage m e t h o d s o f t h e address_space o b je ct s t h a t o wn t h e d irt y p a g e s .

As m e n t io n e d in t h e e a rlie r s e ct io n S e ct io n 1 6 . 3 , t h e address_space o b je ct o f t h e p a g e s t h a t b e lo n g t o t h e s wa p ca ch e is a s p e cia l o b je ct s t o re d in swapper_space. It s writepage m e t h o d is im p le m e n t e d b y t h e swap_writepage( ) fu n ct io n , wh ich e xe cu t e s t h e fo llo win g ste ps: 1 . Ch e cks wh e t h e r t h e p a g e is n o t in clu d e d in t h e Pa g e Ta b le s o f a n y p ro ce s s ; in t h is ca s e , it re m o ve s t h e p a g e fro m t h e s wa p ca ch e a n d re le a s e s t h e s wa p p a g e s lo t . 2 . Ot h e rwis e , it in vo ke s rw_swap_page( ) o n t h e p a g e , s p e cifyin g t h e WRITE co m m a n d ( s e e t h e e a rlie r s e ct io n S e ct io n 1 6 . 4 . 1 ) .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.6 Swapping in Pages S wa p in m u s t t a ke p la ce wh e n a p ro ce s s a t t e m p t s t o a d d re s s a p a g e wit h in it s a d d re s s s p a ce t h a t h a s b e e n s wa p p e d o u t t o d is k. Th e Pa g e Fa u lt e xce p t io n h a n d le r t rig g e rs a s wa p in o p e ra t io n wh e n t h e fo llo win g co n d it io n s o ccu r ( s e e S e ct io n 8 . 4 . 2 ) :







Th e p a g e in clu d in g t h e a d d re s s t h a t ca u s e d t h e e xce p t io n is a va lid o n e —t h a t is , it b e lo n g s t o a m e m o ry re g io n o f t h e cu rre n t p ro ce s s . Th e p a g e is n o t p re s e n t in m e m o ry—t h a t is , t h e Present fla g in t h e Pa g e Ta b le e n t ry is cle a re d . Th e Pa g e Ta b le e n t ry a s s o cia t e d wit h t h e p a g e is n o t n u ll, wh ich m e a n s it co n t a in s a s wa p p e d - o u t p a g e id e n t ifie r.

As d e s crib e d in S e ct io n 8 . 4 . 3 , t h e handle_pte_fault( ) fu n ct io n , in vo ke d b y t h e

do_page_fault( ) e xce p t io n h a n d le r, ch e cks wh e t h e r t h e Pa g e Ta b le e n t ry is n o n - n u ll. If s o , it in vo ke s a q u it e h a n d y do_swap_page( ) fu n ct io n t o s wa p in t h e p a g e re q u ire d . 16.6.1 The do_swap_page( ) Function Th is do_swap_page( ) fu n ct io n a ct s o n t h e fo llo win g p a ra m e t e rs :

mm Me m o ry d e s crip t o r a d d re s s o f t h e p ro ce s s t h a t ca u s e d t h e Pa g e Fa u lt e xce p t io n

vma Me m o ry re g io n d e s crip t o r a d d re s s o f t h e re g io n t h a t in clu d e s address

address Lin e a r a d d re s s t h a t ca u s e s t h e e xce p t io n

page_table Ad d re s s o f t h e Pa g e Ta b le e n t ry t h a t m a p s address

orig_pte Co n t e n t o f t h e Pa g e Ta b le e n t ry t h a t m a p s address

write_access Fla g d e n o t in g wh e t h e r t h e a t t e m p t e d a cce s s wa s a re a d o r a writ e Co n t ra ry t o o t h e r fu n ct io n s , do_swap_page( ) n e ve r re t u rn s 0 . It re t u rn s 1 if t h e p a g e is

a lre a d y in t h e s wa p ca ch e ( m in o r fa u lt ) , 2 if t h e p a g e wa s re a d fro m t h e s wa p a re a ( m a jo r fa u lt ) , a n d - 1 if a n e rro r o ccu rre d wh ile p e rfo rm in g t h e s wa p in . It e s s e n t ia lly e xe cu t e s t h e fo llo win g s t e p s : 1 . Re le a s e s t h e page_table_lock s p in lo ck o f t h e m e m o ry d e s crip t o r ( it wa s a cq u ire d b y t h e ca lle r fu n ct io n handle_pte_fault( )) .

2 . Ge t s t h e s wa p p e d - o u t p a g e id e n t ifie r fro m orig_pte.

3 . In vo ke s lookup_swap_cache( ) t o ch e ck wh e t h e r t h e s wa p ca ch e a lre a d y co n t a in s a p a g e co rre s p o n d in g t o t h e s wa p p e d - o u t p a g e id e n t ifie r; if t h e p a g e is a lre a d y in t h e s wa p ca ch e , it ju m p s t o S t e p 6 . 4 . In vo ke s t h e swapin_readahead( ) fu n ct io n t o re a d fro m t h e s wa p a re a a g ro u p o f a t m o s t 2 n p a g e s , in clu d in g t h e re q u e s t e d o n e . Th e va lu e n is s t o re d in t h e

page_cluster va ria b le , a n d is u s u a lly e q u a l t o 3 . [ 7 ] Ea ch p a g e is re a d b y in vo kin g t h e read_swap_cache_async( ) fu n ct io n . [7]

Th e s ys t e m a d m in is t ra t o r m a y t u n e t h is va lu e b y writ in g in t o t h e / p ro c/ s y s / v m / p a g e - clu s t e r file . S wa p - in re a d - a h e a d ca n b e d is a b le d b y s e t t in g page_cluster t o 0 .

5 . In vo ke s read_swap_cache_async( ) o n ce m o re t o s wa p in p re cis e ly t h e p a g e a cce s s e d b y t h e p ro ce s s t h a t ca u s e d t h e Pa g e Fa u lt . Th is s t e p m ig h t a p p e a r re d u n d a n t , b u t it is n 't re a lly. Th e swapin_readahead( ) fu n ct io n m ig h t fa il in re a d in g t h e re q u e s t e d p a g e —fo r in s t a n ce , b e ca u s e page_cluster is s e t t o 0 o r t h e fu n ct io n t rie d t o re a d a g ro u p o f p a g e s in clu d in g a d e fe ct ive p a g e s lo t ( SWAP_MAP_BAD) . On t h e o t h e r h a n d , if swapin_readahead( ) s u cce e d e d , t h is in vo ca t io n o f read_swap_cache_async( ) t e rm in a t e s q u ickly b e ca u s e it fin d s t h e p a g e in t h e s wa p ca ch e . 6 . If, d e s p it e a ll e ffo rt s , t h e re q u e s t e d p a g e wa s n o t a d d e d t o t h e s wa p ca ch e , a n o t h e r ke rn e l co n t ro l p a t h m ig h t h a ve a lre a d y s wa p p e d in t h e re q u e s t e d p a g e o n b e h a lf o f a clo n e o f t h is p ro ce s s . Th is ca s e is ch e cke d b y t e m p o ra rily a cq u irin g t h e page_table_lock s p in lo ck a n d co m p a rin g t h e e n t ry t o wh ich page_table p o in t s wit h orig_pte. If t h e y d iffe r, t h e p a g e h a s a lre a d y b e e n s wa p p e d in b y s o m e o t h e r ke rn e l t h re a d , s o t h e fu n ct io n re t u rn s 1 ( m in o r fa u lt ) ; o t h e rwis e , it re t u rn s - 1 ( fa ilu re ) . 7 . At t h is p o in t , we kn o w t h a t t h e p a g e is in t h e s wa p ca ch e . In vo ke s mark_page_accessed( ) ( s e e t h e la t e r s e ct io n S e ct io n 1 6 . 7 . 2 ) a n d lo cks t h e page. 8 . Acq u ire s t h e page_table_lock s p in lo ck.

9 . Ch e cks wh e t h e r a n o t h e r ke rn e l co n t ro l p a t h h a s s wa p p e d in t h e re q u e s t e d p a g e o n b e h a lf o f a clo n e o f t h is p ro ce s s . In t h is ca s e , re le a s e s t h e page_table_lock s p in lo ck, u n lo cks t h e p a g e , a n d re t u rn s 1 ( m in o r fa u lt ) .

1 0 . In vo ke s swap_free( ) t o d e cre m e n t t h e u s a g e co u n t e r o f t h e p a g e s lo t co rre s p o n d in g t o entry.

1 1 . Ch e cks wh e t h e r t h e s wa p ca ch e is a t le a s t 5 0 p e rce n t fu ll ( nr_swap_pages is s m a lle r t h a n a h a lf o f total_swap_pages) . If s o , ch e cks wh e t h e r t h e p a g e is o wn e d o n ly b y t h e p ro ce s s t h a t ca u s e d t h e fa u lt ( o r o n e o f it s clo n e s ) ; if t h is is t h e ca s e , re m o ve s t h e p a g e fro m t h e s wa p ca ch e . 1 2 . In cre m e n t s t h e rss fie ld o f t h e p ro ce s s 's m e m o ry d e s crip t o r.

1 3 . Un lo cks t h e p a g e . 1 4 . Up d a t e s t h e Pa g e Ta b le e n t ry s o t h e p ro ce s s ca n fin d t h e p a g e . Th e fu n ct io n a cco m p lis h e s t h is b y writ in g t h e p h ys ica l a d d re s s o f t h e re q u e s t e d p a g e a n d t h e p ro t e ct io n b it s fo u n d in t h e vm_page_prot fie ld o f t h e m e m o ry re g io n in t o t h e Pa g e Ta b le e n t ry a d d re s s e d b y page_table. Mo re o ve r, if t h e a cce s s t h a t ca u s e d t h e fa u lt wa s a writ e a n d t h e fa u lt in g p ro ce s s is t h e u n iq u e o wn e r o f t h e p a g e , t h e fu n ct io n a ls o s e t s t h e Dirty fla g a n d t h e Read/Write fla g t o p re ve n t a u s e le s s Co p y o n Writ e fa u lt . 1 5 . Re le a s e s t h e mm->page_table_lock s p in lo ck a n d re t u rn s 1 ( m in o r fa u lt ) o r 2 ( m a jo r fa u lt ) . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

16.7 Reclaiming Page Frame Th e virt u a l m e m o ry s u b s ys t e m o f Lin u x is , wit h o u t a n y d o u b t , t h e m o s t co m p le x a n d p e rfo rm a n ce - crit ica l co m p o n e n t o f t h e wh o le ke rn e l. In p re vio u s ch a p t e rs , we e xp la in e d h o w t h e ke rn e l h a n d le s d yn a m ic m e m o ry b y ke e p in g t ra ck o f fre e a n d b u s y p a g e fra m e s . We h a ve a ls o d is cu s s e d h o w e ve ry p ro ce s s in Us e r Mo d e h a s it s o wn lin e a r a d d re s s s p a ce s o t h a t p a g e fra m e s ca n b e a s s ig n e d t o t h e p ro ce s s a t t h e ve ry la s t p o s s ib le m o m e n t . Fin a lly, we h a ve a ls o d e s crib e d h o w d yn a m ic m e m o ry is u s e d t o ca ch e t h e d a t a o f t h e s lo w b lo ck d e vice s . In t h is ch a p t e r, we co m p le t e o u r d e s crip t io n o f t h e virt u a l m e m o ry s u b s ys t e m b y d is cu s s in g p a g e fra m e re cla im in g . As we s a w in Ch a p t e r 1 4 , t h e ca ch e s ys t e m s g ra b m o re a n d m o re p a g e fra m e s b u t n e ve r re le a s e a n y o f t h e m . Th is is re a s o n a b le b e ca u s e ca ch e s ys t e m s d o n 't kn o w if a n d wh e n p ro ce s s e s will re u s e s o m e o f t h e ca ch e d d a t a a n d a re t h e re fo re u n a b le t o id e n t ify t h e p o rt io n s o f ca ch e t h a t s h o u ld b e re le a s e d . Mo re o ve r, t h a n ks t o t h e d e m a n d p a g in g m e ch a n is m d e s crib e d in Ch a p t e r 8 , Us e r Mo d e p ro ce s s e s g e t p a g e fra m e s a s lo n g a s t h e y p ro ce e d wit h t h e ir e xe cu t io n ; h o we ve r, d e m a n d p a g in g h a s n o wa y t o fo rce p ro ce s s e s t o re le a s e t h e p a g e fra m e s wh e n e ve r t h e y a re n o lo n g e r u s e d . Pa g e fra m e re cla im in g is a re m e d y fo r t h is p ro b le m . Th e ke rn e l d e ve lo p e rs ' wo rs t n ig h t m a re is t o e n co u n t e r a s it u a t io n in wh ich n o fre e p a g e fra m e e xis t s . Wh e n t h is h a p p e n s , t h e ke rn e l m ig h t b e e a s ily t ra p p e d in a d e a d ly ch a in o f m e m o ry re q u e s t s —t o fre e a p a g e fra m e , t h e ke rn e l m u s t writ e it s d a t a t o d is k. Ho we ve r, t o a cco m p lis h t h is o p e ra t io n , t h e ke rn e l re q u ire s a n o t h e r p a g e fra m e ( fo r in s t a n ce , t o a llo ca t e t h e b u ffe r h e a d s fo r t h e I/ O d a t a t ra n s fe r) . S in ce n o fre e p a g e fra m e e xis t s , n o p a g e fra m e ca n b e fre e d . In t h is s it u a t io n , t h e re is ju s t o n e s o lu t io n : kill a vict im Us e r Mo d e p ro ce s s t o re cla im t h e p a g e fra m e s it wa s u s in g . Of co u rs e , e ve n if t h is s o lu t io n a vo id s a s ys t e m cra s h , it is n o t ve ry s a t is fyin g fo r t h e e n d u s e rs . Th e g o a l o f p a g e fra m e re cla im in g is t o co n s e rve a m in im a l p o o l o f fre e p a g e fra m e s s o t h a t t h e ke rn e l m a y s a fe ly re co ve r fro m "lo w o n m e m o ry" co n d it io n s . To d o t h is , it m u s t n e it h e r t ra s h t h e d is k ca ch e s n o r p e n a lize Us e r Mo d e p ro ce s s e s t o o m u ch , o t h e rwis e s ys t e m p e rfo rm a n ce s will b e g re a t ly re d u ce d . As a m a t t e r o f fa ct , t h e h a rd e s t jo b o f a d e ve lo p e r wo rkin g o n t h e virt u a l m e m o ry s u b s ys t e m co n s is t s o f fin d in g a n a lg o rit h m t h a t e n s u re s a cce p t a b le p e rfo rm a n ce s b o t h t o d e s kt o p m a ch in e s ( o n wh ich m e m o ry re q u e s t s a re q u it e lim it e d ) a n d t o h ig h - le ve l m a ch in e s like la rg e d a t a b a s e s e rve rs ( o n wh ich m e m o ry re q u e s t s te nd to be huge ). Un fo rt u n a t e ly, fin d in g a g o o d p a g e fra m e re cla im in g a lg o rit h m is a ra t h e r e m p irica l jo b , wit h ve ry lit t le s u p p o rt fro m t h e o ry. Th e s it u a t io n is s o m e wh a t s im ila r t o e va lu a t in g t h e fa ct o rs t h a t d e t e rm in e t h e d yn a m ic p rio rit y o f a p ro ce s s : t h e m a in o b je ct ive is t o t u n e t h e p a ra m e t e rs t h a t a ch ie ve g o o d s ys t e m p e rfo rm a n ce , wit h o u t a s kin g t o o m a n y q u e s t io n s a b o u t wh y it wo rks we ll. Oft e n , it 's ju s t a m a t t e r o f "le t 's t ry t h is a p p ro a ch a n d s e e wh a t h a p p e n s . . . " An u n p le a s a n t s id e e ffe ct o f t h is e m p irica l a p p ro a ch is t h e co d e ch a n g e s q u ickly, e ve n in t h e e ve n n u m b e re d ve rs io n s o f Lin u x, wh ich a re s u p p o s e d t o b e s t a b le . Th e d e s crip t io n t h a t fo llo ws re fe rs t o Lin u x 2 . 4 . 1 8 .

16.7.1 Outline of the Page Frame Reclaiming Algorithm Be fo re p lu n g in g in t o d e t a ils , le t 's g ive a b rie f o ve rvie w o f Lin u x p a g e fra m e re cla im in g . ( Lo o kin g t o o clo s e t o t h e t re e s ' le a ve s m ig h t le a d u s t o m is s t h e wh o le fo re s t !)

Pa g e fra m e s ca n b e fre e d in t wo wa ys : ●



By re cla im in g a n u n u s e d p a g e fra m e wit h in a ca ch e ( e it h e r a m e m o ry ca ch e o r a d is k ca ch e ) By re cla im in g a p a g e t h a t b e lo n g s t o a m e m o ry re g io n o f a p ro ce s s o r t o a n IPC s h a re d m e m o ry re g io n ( s e e S e ct io n 1 9 . 3 . 5 )

Of co u rs e , t h e a lg o rit h m s h o u ld t a ke in t o co n s id e ra t io n t h e va rio u s d iffe re n t kin d s o f p a g e fra m e s . Fo r in s t a n ce , it is p re fe ra b le t o re cla im p a g e fra m e s fro m a m e m o ry ca ch e ra t h e r t h a n fro m a d is k ca ch e b e ca u s e t h e la t t e r p a g e s in clu d e p re cio u s d a t a o b t a in e d b y co s t ly a cce s s e s t o b lo ck d is k d e vice s . Mo re o ve r, t h e a lg o rit h m s h o u ld ke e p t ra ck o f t h e n u m b e r o f a cce s s e s t o e ve ry p a g e fra m e . If a p a g e h a s n o t b e e n a cce s s e d fo r a lo n g t im e , t h e p ro b a b ilit y t h a t it will b e a cce s s e d in t h e n e a r fu t u re is lo w; o n t h e o t h e r h a n d , if a p a g e h a s b e e n a cce s s e d re ce n t ly, t h e p ro b a b ilit y t h a t it will co n t in u e t o b e a cce s s e d is h ig h . Th is is ju s t a n o t h e r a p p lica t io n o f t h e lo ca lit y p rin cip le m e n t io n e d in S e ct io n 2 . 4 . 7 . Th e re fo re , t h e p a g e fra m e re cla im in g a lg o rit h m is a b le n d o f s e ve ra l h e u ris t ics : ● ●



Ca re fu l s e le ct io n o f t h e o rd e r in wh ich ca ch e s a re e xa m in e d Ord e rin g o f p a g e s b a s e d o n a g e in g ( le a s t re ce n t ly u s e d p a g e s s h o u ld b e fre e d b e fo re p a g e s a cce s s e d re ce n t ly) Dis t in ct io n o f p a g e s b a s e d o n t h e p a g e s t a t e ( fo r e xa m p le , n o n d irt y p a g e s a re b e t t e r ca n d id a t e s t h a n d irt y p a g e s fo r s wa p p in g o u t b e ca u s e t h e y d o n 't h a ve t o b e writ t e n t o d is k)

Th e m a in fu n ct io n t h a t t rig g e rs p a g e fra m e re cla im in g is try_to_free_pages( ). It is in vo ke d e ve ry t im e t h e ke rn e l fa ils in a llo ca t in g m e m o ry. Fo r in s t a n ce : ●

Wh e n t h e grow_buffers( ) fu n ct io n fa ils t o a llo ca t e a n e w b u ffe r p a g e , o r t h e

create_buffers( ) fu n ct io n fa ils t o a llo ca t e t h e b u ffe r h e a d s fo r a b u ffe r p a g e ( s e e S e ct io n 1 4 . 2 . 2 a n d S e ct io n 1 4 . 2 . 3 ) . In t h e s e ca s e s , t h e ke rn e l e xe cu t e s ●

free_more_memory( ), wh ich in t u rn in vo ke s try_to_free_pages( ). Wh e n t h e pages_alloc( ) fu n ct io n fa ils in a llo ca t in g a g ro u p o f p a g e fra m e s in a g ive n lis t o f m e m o ry zo n e s ( s e e S e ct io n 7 . 1 . 7 ) . Re ca ll t h a t e ve ry m e m o ry zo n e d e s crip t o r in clu d e s t h e pages_min wa t e rm a rk, wh ich s p e cifie s t h e n u m b e r o f p a g e fra m e s t h a t s h o u ld re m a in fre e t o co p e wit h t h e "lo w o n m e m o ry" e m e rg e n cie s . If n o zo n e in t h e lis t h a s e n o u g h fre e m e m o ry t o s a t is fy t h e re q u e s t wh ile p re s e rvin g t h e m in im a l p o o l o f fre e p a g e fra m e s , t h e ke rn e l in vo ke s t h e balance_classzone( ) fu n ct io n , wh ich in t u rn in vo ke s try_to_free_pages( ).



Wh e n t h e k s w a p d ke rn e l t h re a d d is co ve rs t h a t t h e n u m b e r o f fre e p a g e fra m e s in s o m e m e m o ry zo n e fa lls b e lo w t h e pages_low wa t e rm a rk ( s e e t h e la t e r s e ct io n S e ct io n 16.7.7).

Th e co re o f t h e try_to_free_pages( ) fu n ct io n is t h e shrink_caches( ) fu n ct io n : it re ce ive s a s a p a ra m e t e r a "g o a l"—n a m e ly, a g ive n n u m b e r o f p a g e fra m e s t o b e re cla im e d —a n d it t e rm in a t e s a s s o o n a s it h a s re a ch e d t h e g o a l, if p o s s ib le . To h e lp shrink_caches( ) d o it s jo b , a ll p a g e s in d yn a m ic m e m o ry a re g ro u p e d in t o t wo lis t s ca lle d t h e "a ct ive lis t " a n d t h e "in a ct ive lis t "; t h e y a re a ls o co lle ct ive ly d e n o t e d a s LRU lis t s . Th e fo rm e r lis t t e n d s t o in clu d e t h e p a g e s t h a t h a ve b e e n a cce s s e d re ce n t ly, wh ile t h e la t t e r t e n d s t o in clu d e t h e p a g e s t h a t h a ve n o t b e e n a cce s s e d fo r s o m e t im e . Cle a rly, p a g e s

s h o u ld b e s t o le n fro m t h e in a ct ive lis t , a lt h o u g h s o m e p e rco la t io n b e t we e n t h e t wo lis t s is p e rfo rm e d fro m t im e t o t im e . Th e shrink_caches( ) fu n ct io n in vo ke s , in t u rn , t h e fo llo win g fu n ct io n s :

kmem_cache_reap( ) Re m o ve s e m p t y s la b s fro m t h e s la b ca ch e

refill_inactive( ) Mo ve s p a g e s fro m t h e a ct ive lis t t o t h e in a ct ive lis t , a n d vice ve rs a .

shrink_cache( ) Trie s t o fre e p a g e fra m e s b y writ in g t o d is k in a ct ive p a g e s in clu d e d in t h e p a g e ca ch e .

shrink_dcache_memory( ) Re m o ve s e n t rie s fro m t h e d e n t ry ca ch e

shrink_icache_memory( ) Re m o ve s e n t rie s fro m t h e in o d e ca ch e Le t 's n o w d is cu s s in g re a t e r d e t a il t h e va rio u s co m p o n e n t s o f t h e p a g e fra m e re cla im in g a lg o rit h m .

16.7.2 The Least Recently Used (LRU) Lists Th e a ct iv e lis t a n d t h e in a ct iv e lis t o f p a g e s a re t h e co re d a t a s t ru ct u re s o f t h e p a g e fra m e re cla im in g a lg o rit h m . Th e h e a d s o f t h e s e t wo d o u b ly lin ke d lis t s a re s t o re d , re s p e ct ive ly, in t h e active_list a n d inactive_list va ria b le s . Th e nr_active_pages a n d

nr_inactive_pages va ria b le s s t o re t h e n u m b e r o f p a g e s in t h e t wo lis t s . Th e pagemap_lru_lock s p in lo ck p ro t e ct s t h e t wo lis t s a g a in s t co n cu rre n t a cce s s e s in S MP s ys t e m s . If a p a g e b e lo n g s t o a n LRU lis t , it s PG_lru fla g in t h e p a g e d e s crip t o r is s e t . Mo re o ve r, if t h e p a g e b e lo n g s t o t h e a ct ive lis t , t h e PG_active fla g is s e t , wh ile if it b e lo n g s t o t h e in a ct ive lis t , t h e PG_active fla g is cle a re d . Th e lru fie ld o f t h e p a g e d e s crip t o r s t o re s t h e p o in t e rs t o t h e n e xt a n d p re vio u s e le m e n t s in t h e LRU lis t . S e ve ra l a u xilia ry fu n ct io n s a n d m a cro s a re a va ila b le t o h a n d le t h e LRU lis t s :

add_page_to_active_list S e t s t h e PG_active fla g , a d d s t h e p a g e t o t h e h e a d o f t h e a ct ive lis t , a n d in cre a s e s

nr_active_pages.

add_page_to_inactive_list Ad d s t h e p a g e t o t h e h e a d o f t h e in a ct ive lis t a n d in cre a s e s nr_inactive_pages.

del_page_from_active_list Re m o ve s t h e p a g e fro m t h e a ct ive lis t , cle a rs t h e PG_active fla g , a n d d e cre a s e s

nr_active_pages. del_page_from_inactive_list Re m o ve s t h e p a g e fro m t h e in a ct ive lis t a n d d e cre a s e s nr_inactive_pages.

activate_page_nolock( ) a n d activate_page( ) If t h e p a g e is in t h e in a ct ive lis t , m o ve s it in t h e a ct ive lis t b y e xe cu t in g del_page_from_inactive_list a n d t h e n add_page_to_active_list. Th e

activate_page( ) fu n ct io n a ls o a cq u ire s t h e pagemap_lru_lock s p in lo ck b e fo re m o vin g t h e p a g e .

lru_cache_add( ) If t h e p a g e is n o t in clu d e d in a LRU lis t , s e t s t h e PG_lru fla g , a cq u ire s t h e

pagemap_lru_lock s p in lo ck, a n d e xe cu t e s add_page_to_inactive_list t o in s e rt t h e p a g e in t h e in a ct ive lis t .

_ _lru_cache_del( ) a n d lru_cache_del( ) If t h e p a g e is in clu d e d in a LRU lis t , cle a rs t h e PG_lru fla g a n d e xe cu t e s e it h e r

del_page_from_active_list o r del_page_from_inactive_list, a cco rd in g t o t h e va lu e o f t h e PG_active fla g . Th e lru_cache_del( ) fu n ct io n a ls o a cq u ire s t h e pagemap_lru_lock s p in lo ck b e fo re re m o vin g t h e p a g e . 16.7.2.1 Moving pages across the LRU lists Th e ke rn e l co lle ct s t h e p a g e s t h a t we re re ce n t ly a cce s s e d in t h e a ct ive lis t s o t h a t it will n o t s ca n t h e m wh e n lo o kin g fo r a p a g e fra m e t o re cla im . Co n ve rs e ly, t h e ke rn e l co lle ct s t h e p a g e s t h a t h a ve n o t b e e n a cce s s e d fo r a lo n g t im e in t h e in a ct ive lis t . Of co u rs e , p a g e s s h o u ld m o ve fro m t h e in a ct ive lis t t o t h e a ct ive lis t a n d b a ck, a cco rd in g t o wh e t h e r t h e y a re b e in g a cce s s e d . Cle a rly, t wo p a g e s t a t e s ( "a ct ive " a n d "in a ct ive ") a re n o t s u fficie n t t o d e s crib e a ll p o s s ib le a cce s s p a t t e rn s . Fo r in s t a n ce , s u p p o s e a lo g g e r p ro ce s s writ e s s o m e d a t a in a p a g e o n ce e ve ry h o u r. Alt h o u g h t h e p a g e is "in a ct ive " fo r m o s t o f t h e t im e , t h e a cce s s m a ke s it "a ct ive , " t h u s d e n yin g t h e re cla im in g o f t h e co rre s p o n d in g p a g e fra m e , e ve n if it is n o t g o in g t o b e a cce s s e d fo r a n e n t ire h o u r. Of co u rs e , t h e re is n o g e n e ra l s o lu t io n t o t h is p ro b le m b e ca u s e t h e ke rn e l h a s n o wa y t o p re d ict t h e b e h a vio r o f Us e r Mo d e p ro ce s s e s ; h o we ve r, it s e e m s re a s o n a b le t h a t p a g e s s h o u ld n o t ch a n g e t h e ir s t a t u s o n e ve ry s in g le a cce s s . Th e PG_referenced fla g in t h e p a g e d e s crip t o r is u s e d t o d o u b le t h e n u m b e r o f a cce s s e s

re q u ire d t o m o ve a p a g e fro m t h e in a ct ive lis t t o t h e a ct ive lis t ; it is a ls o u s e d t o d o u b le t h e n u m b e r o f "m is s in g a cce s s e s " re q u ire d t o m o ve a p a g e fro m t h e a ct ive lis t t o t h e in a ct ive lis t ( s e e b e lo w) . Fo r in s t a n ce , s u p p o s e t h a t a p a g e in t h e in a ct ive lis t h a s t h e PG_referenced fla g s e t t o 0 . Th e firs t p a g e a cce s s s e t s t h e va lu e o f t h e fla g t o 1 , b u t t h e p a g e re m a in s in t h e in a ct ive lis t . Th e s e co n d p a g e a cce s s fin d s t h e fla g s e t a n d ca u s e s t h e p a g e t o b e m o ve d in t h e a ct ive lis t . If, h o we ve r, t h e s e co n d a cce s s d o e s n o t o ccu r wit h in a g ive n t im e in t e rva l a ft e r t h e firs t o n e , t h e p a g e fra m e re cla im in g a lg o rit h m m a y re s e t t h e PG_referenced fla g .

As s h o wn in Fig u re 1 6 - 4 , t h e ke rn e l u s e s t h e mark_page_accessed( ) a n d

refill_inactive( ) fu n ct io n s t o m o ve t h e p a g e s a cro s s t h e LRU lis t s . In t h e fig u re , t h e LRU lis t in clu d in g t h e p a g e is s p e cifie d b y t h e s t a t u s o f t h e PG_active fla g . Fig u re 1 6 - 4 . Mo v in g p a g e s a c ro s s t h e LRU lis t s

Wh e n e ve r t h e ke rn e l m u s t m a rk a p a g e a s a cce s s e d , it in vo ke s t h e mark_page_accessed( ) fu n ct io n . Th is h a p p e n s e ve ry t im e t h e ke rn e l d e t e rm in e s t h a t a p a g e is b e in g re fe re n ce d e it h e r b y a Us e r Mo d e p ro ce s s , a file s ys t e m la ye r, o r a d e vice d rive r. Fo r in s t a n ce , mark_page_accessed( ) is in vo ke d in t h e fo llo win g ca s e s :



Wh e n lo a d in g a n a n o n ym o u s p a g e o f a p ro ce s s o n d e m a n d ( p e rfo rm e d b y t h e do_anonymous_page( ) fu n ct io n in S e ct io n 8 . 4 . 3 ) .



Wh e n re a d in g a b lo ck fro m d is k ( p e rfo rm e d b y t h e bread( ) fu n ct io n in S e ct io n



13.4.8). Wh e n lo a d in g o n d e m a n d a p a g e o f a m e m o ry m a p p e d file ( p e rfo rm e d b y t h e filemap_nopage( ) fu n ct io n in S e ct io n 1 5 . 2 . 4 ) .



Wh e n re a d in g a p a g e o f d a t a fro m a file ( p e rfo rm e d b y t h e do_generic_file_read(

) fu n ct io n in S e ct io n 1 5 . 1 . 1 ) . ●

Wh e n s wa p p in g in a p a g e ( s e e t h e e a rlie r s e ct io n S e ct io n 1 6 . 6 . 1 ) .



Wh e n t h e ke rn e l fin d s t h e Accessed fla g s e t in t h e Pa g e Ta b le e n t ry wh ile s e a rch in g



fo r a p a g e t o b e s wa p p e d o u t ( s e e t h e e a rlie r s e ct io n S e ct io n 1 6 . 5 . 1 ) . Wh e n t h e ke rn e l re a d s a p a g e o f d a t a fro m a d is k d e vice ( p e rfo rm e d b y t h e ext2_get_page( ) fu n ct io n in Ch a p t e r 1 7 ) .

Th e mark_page_accessed( ) fu n ct io n e xe cu t e s t h e fo llo win g co d e fra g m e n t :

if (PageActive(page) || !PageReferenced(page)) SetPageReferenced(page); else {

activate_page(page); ClearPageReferenced(page); } As s h o wn in Fig u re 1 6 - 4 , t h e fu n ct io n m o ve s t h e p a g e fro m t h e in a ct ive lis t t o t h e a ct ive lis t o n ly if t h e PG_referenced fla g is s e t b e fo re t h e in vo ca t io n .

Th e ke rn e l p e rio d ica lly ch e cks t h e s t a t u s o f t h e p a g e s in t h e a ct ive lis t b y e xe cu t in g t h e refill_inactive( ) fu n ct io n . S t a rt in g fro m t h e b o t t o m o f t h e a ct ive lis t ( t h e o ld e r p a g e s in t h e lis t ) , t h e fu n ct io n ch e cks wh e t h e r t h e PG_referenced fla g o f e a ch p a g e is s e t . If it is , t h e fu n ct io n cle a rs t h e fla g a n d m o ve s t h e p a g e in t o t h e firs t p o s it io n o f t h e a ct ive lis t ; if it is n 't , t h e fu n ct io n m o ve s t h e p a g e in t o t h e firs t p o s it io n o f t h e in a ct ive lis t . Th e lo g ic in t h e fu n ct io n is a s fo llo ws :

if (PageReferenced(page)) { ClearPageReferenced(page); list_del(&page->lru); list_add(&page->lru, &active_list); } else { del_page_from_active_list(page); add_page_to_inactive_list(page); SetPageReferenced(page); } Th e refill_inactive( ) fu n ct io n d o e s n o t s ca n t h e p a g e s in t h e in a ct ive lis t ; h e n ce , t h e

PG_referenced fla g o f a p a g e is n e ve r cle a re d a s lo n g a s t h e p a g e re m a in s in t h e in a ct ive lis t .

16.7.3 The try_to_ free_ pages( ) Function Th e try_to_free_pages( ) fu n ct io n is t h e m a in fu n ct io n t h a t t rig g e rs t h e re cla im in g o f p a g e fra m e s . It re ce ive s a s p a ra m e t e rs :

classzone Th e m e m o ry zo n e co n t a in in g t h e p a g e fra m e s t o b e re cla im e d

gfp_mask A s e t o f fla g s wh o s e m e a n in g is e xa ct ly t h e s a m e a s t h e co rre s p o n d in g p a ra m e t e r o f t h e alloc_pages( ) fu n ct io n ( s e e S e ct io n 7 . 1 . 5 )

order No t u s e d Th e g o a l o f t h e fu n ct io n is t o fre e SWAP_CLUSTER_MAX p a g e fra m e s ( u s u a lly, 3 2 ) b y re p e a t e d ly in vo kin g t h e shrink_caches( ) fu n ct io n , e a ch t im e wit h a h ig h e r p rio rit y t h a n t h e p re vio u s in vo ca t io n . Th e try_to_free_pages( ) fu n ct io n is t h u s e s s e n t ia lly e q u iva le n t t o t h e fo llo win g co d e fra g m e n t :

int priority = DEF_PRIORITY; int nr_pages = SWAP_CLUSTER_MAX; if (current->flags & PF_NOIO) gfp_mask &= ~(_ _GFP_IO | _ _GFP_HIGHIO | _ _GFP_FS); do { nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages); if (nr_pages /tmp/dump_hex t o g e t a file co n t a in in g t h e h e xa d e cim a l d u m p o f t h e flo p p y d is k co n t e n t s in t h e / t m p

d ire ct o ry. [ 1 ] [1]

S o m e in fo rm a t io n o n a n Ext 2 file s ys t e m co u ld a ls o b e o b t a in e d b y u s in g t h e d u m p e 2 fs a n d d e b u g fs u t ilit y p ro g ra m s .

By lo o kin g a t t h a t file , we ca n s e e t h a t , d u e t o t h e lim it e d ca p a cit y o f t h e d is k, a s in g le g ro u p d e s crip t o r is s u fficie n t . We a ls o n o t ice t h a t t h e n u m b e r o f re s e rve d b lo cks is s e t t o 7 2 ( 5 p e rce n t o f 1 , 4 4 0 ) a n d , a cco rd in g t o t h e d e fa u lt o p t io n , t h e in o d e t a b le m u s t in clu d e 1 in o d e fo r e a ch 4 , 0 9 6 b yt e s — t h a t is , 3 6 0 in o d e s s t o re d in 4 5 b lo cks . Ta b le 1 7 - 7 s u m m a rize s h o w t h e Ext 2 file s ys t e m is cre a t e d o n a flo p p y d is k wh e n t h e d e fa u lt o p t io n s a re s e le ct e d .

Ta b le 1 7 - 7 . Ex t 2 b lo c k a llo c a t io n fo r a flo p p y d is k

Blo c k

Co n t e n t

0

Bo o t b lo ck

1

S u p e rb lo ck

2

Blo ck co n t a in in g a s in g le b lo ck g ro u p d e s crip t o r

3

Da t a b lo ck b it m a p

4

In o d e b it m a p

5-49

In o d e t a b le : in o d e s u p t o 1 0 : re s e rve d ; in o d e 1 1 : lo s t + fo u n d ; in o d e s 1 2 - 3 6 0 : fre e

50

Ro o t d ire ct o ry ( in clu d e s ., .., a n d lo s t + fo u n d )

51

lo s t + fo u n d d ire ct o ry ( in clu d e s . a n d ..)

52-62

Re s e rve d b lo cks p re a llo ca t e d fo r lo s t + fo u n d d ire ct o ry

6 3 - 1 4 3 9 Fre e b lo cks

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

17.5 Ext2 Methods Ma n y o f t h e VFS m e t h o d s d e s crib e d in Ch a p t e r 1 2 h a ve a co rre s p o n d in g Ext 2 im p le m e n t a t io n . S in ce it wo u ld t a ke a wh o le b o o k t o d e s crib e a ll o f t h e m , we lim it o u rs e lve s t o b rie fly re vie win g t h e m e t h o d s im p le m e n t e d in Ext 2 . On ce t h e d is k a n d t h e m e m o ry d a t a s t ru ct u re s a re cle a rly u n d e rs t o o d , t h e re a d e r s h o u ld b e a b le t o fo llo w t h e co d e o f t h e Ext 2 fu n ct io n s t h a t im p le m e n t t h e m .

17.5.1 Ext2 Superblock Operations Ma n y VFS s u p e rb lo ck o p e ra t io n s h a ve a s p e cific im p le m e n t a t io n in Ext 2 , n a m e ly read_inode, write_inode, put_inode, delete_inode, put_super, write_super,

statfs, a n d remount_fs. Th e a d d re s s e s o f t h e s u p e rb lo ck m e t h o d s a re s t o re d in t o t h e ext2_sops a rra y o f p o in t e rs . 17.5.2 Ext2 Inode Operations S o m e o f t h e VFS in o d e o p e ra t io n s h a ve a s p e cific im p le m e n t a t io n in Ext 2 , wh ich d e p e n d s o n t h e t yp e o f t h e file t o wh ich t h e in o d e re fe rs . If t h e in o d e re fe rs t o a re g u la r file , a ll in o d e o p e ra t io n s lis t e d in t h e

ext2_file_inode_operations t a b le h a ve a NULL p o in t e r, e xce p t fo r t h e truncate o p e ra t io n t h a t is im p le m e n t e d b y t h e ext2_truncate( ) fu n ct io n . Re ca ll t h a t t h e VFS u s e s it s o wn g e n e ric fu n ct io n s wh e n t h e co rre s p o n d in g Ext 2 m e t h o d is u n d e fin e d ( a NULL p o in t e r) . If t h e in o d e re fe rs t o a d ire ct o ry, m o s t in o d e o p e ra t io n s lis t e d in t h e ext2_dir_inode_operations t a b le a re im p le m e n t e d b y s p e cific Ext 2 fu n ct io n s ( s e e Ta b le 17-8).

Ta b le 1 7 - 8 . Ex t 2 in o d e o p e ra t io n s fo r d ire c t o ry file s

VFS in o d e o p e ra t io n

Ex t 2 d ire c t o ry in o d e m e t h o d

create

ext2_create( )

lookup

ext2_lookup( )

link

ext2_link( )

unlink

ext2_unlink( )

symlink

ext2_symlink( )

mkdir

ext2_mkdir( )

rmdir

ext2_rmdir( )

mknod

ext2_mknod( )

rename

ext2_rename( )

If t h e in o d e re fe rs t o a s ym b o lic lin k t h a t ca n b e fu lly s t o re d in s id e t h e in o d e it s e lf, a ll in o d e m e t h o d s a re NULL e xce p t fo r readlink a n d follow_link, wh ich a re im p le m e n t e d b y

ext2_readlink( ) a n d ext2_follow_link( ), re s p e ct ive ly. Th e a d d re s s e s o f t h o s e m e t h o d s a re s t o re d in t h e ext2_fast_symlink_inode_operations t a b le . On t h e o t h e r h a n d , if t h e in o d e re fe rs t o a lo n g s ym b o lic lin k t h a t h a s t o b e s t o re d in s id e a d a t a b lo ck, t h e readlink a n d follow_link m e t h o d s a re im p le m e n t e d b y t h e g e n e ric page_readlink(

) a n d page_follow_link( ) fu n ct io n s , wh o s e a d d re s s e s a re s t o re d in t h e page_symlink_inode_operations t a b le . If t h e in o d e re fe rs t o a ch a ra ct e r d e vice file , t o a b lo ck d e vice file , o r t o a n a m e d p ip e ( s e e S e ct io n 1 9 . 2 ) , t h e in o d e o p e ra t io n s d o n o t d e p e n d o n t h e file s ys t e m . Th e y a re s p e cifie d in t h e chrdev_inode_operations, blkdev_inode_operations, a n d

fifo_inode_operations t a b le s , re s p e ct ive ly. 17.5.3 Ext2 File Operations Th e file o p e ra t io n s s p e cific t o t h e Ext 2 file s ys t e m a re lis t e d in Ta b le 1 7 - 9 . As yo u ca n s e e , s e ve ra l VFS m e t h o d s a re im p le m e n t e d b y g e n e ric fu n ct io n s t h a t a re co m m o n t o m a n y file s ys t e m s . Th e a d d re s s e s o f t h e s e m e t h o d s a re s t o re d in t h e ext2_file_operations t a b le .

Ta b le 1 7 - 9 . Ex t 2 file o p e ra t io n s

VFS file o p e ra t io n

Ex t 2 m e t h o d

llseek

generic_file_llseek( )

read

generic_file_read( )

write

generic_file_write( )

ioctl

ext2_ioctl( )

mmap

generic_file_mmap( )

open

generic_file_open( )

release

ext2_release_file( )

fsync

ext2_sync_file( )

No t ice t h a t t h e Ext 2 's read a n d write m e t h o d s a re im p le m e n t e d b y t h e

generic_file_read( ) a n d generic_file_write( ) fu n ct io n s , re s p e ct ive ly. Th e s e a re d e s crib e d in S e ct io n 1 5 . 1 . 1 a n d S e ct io n 1 5 . 1 . 3 .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

17.6 Managing Ext2 Disk Space Th e s t o ra g e o f a file o n d is k d iffe rs fro m t h e vie w t h e p ro g ra m m e r h a s o f t h e file in t wo wa ys : b lo cks ca n b e s ca t t e re d a ro u n d t h e d is k ( a lt h o u g h t h e file s ys t e m t rie s h a rd t o ke e p b lo cks s e q u e n t ia l t o im p ro ve a cce s s t im e ) , a n d file s m a y a p p e a r t o a p ro g ra m m e r t o b e b ig g e r t h a n t h e y re a lly a re b e ca u s e a p ro g ra m ca n in t ro d u ce h o le s in t o t h e m ( t h ro u g h t h e lseek( ) s ys t e m ca ll) .

In t h is s e ct io n , we e xp la in h o w t h e Ext 2 file s ys t e m m a n a g e s t h e d is k s p a ce — h o w it a llo ca t e s a n d d e a llo ca t e s in o d e s a n d d a t a b lo cks . Two m a in p ro b le m s m u s t b e a d d re s s e d : ●

S p a ce m a n a g e m e n t m u s t m a ke e ve ry e ffo rt t o a vo id file fra g m e n t a t io n — t h e p h ys ica l s t o ra g e o f a file in s e ve ra l, s m a ll p ie ce s lo ca t e d in n o n a d ja ce n t d is k b lo cks . File fra g m e n t a t io n in cre a s e s t h e a ve ra g e t im e o f s e q u e n t ia l re a d o p e ra t io n s o n t h e file s , s in ce t h e d is k h e a d s m u s t b e fre q u e n t ly re p o s it io n e d d u rin g t h e re a d o p e ra t io n . [ 2 ] Th is p ro b le m is s im ila r t o t h e e xt e rn a l fra g m e n t a t io n o f RAM d is cu s s e d in S e ct io n 7 . 1 . 7 . [2]

Ple a s e n o t e t h a t fra g m e n t in g a file a cro s s b lo ck g ro u p s ( A Ba d Th in g ) is q u it e d iffe re n t fro m t h e n o t - ye t im p le m e n t e d fra g m e n t a t io n o f b lo cks t o s t o re m a n y file s in o n e b lo ck ( A Go o d Th in g ) .



S p a ce m a n a g e m e n t m u s t b e t im e - e fficie n t ; t h a t is , t h e ke rn e l s h o u ld b e a b le t o q u ickly d e rive fro m a file o ffs e t t h e co rre s p o n d in g lo g ica l b lo ck n u m b e r in t h e Ext 2 p a rt it io n . In d o in g s o , t h e ke rn e l s h o u ld lim it a s m u ch a s p o s s ib le t h e n u m b e r o f a cce s s e s t o a d d re s s in g t a b le s s t o re d o n d is k, s in ce e a ch s u ch in t e rm e d ia t e a cce s s co n s id e ra b ly in cre a s e s t h e a ve ra g e file a cce s s t im e .

17.6.1 Creating Inodes Th e ext2_new_inode( ) fu n ct io n cre a t e s a n Ext 2 d is k in o d e , re t u rn in g t h e a d d re s s o f t h e co rre s p o n d in g in o d e o b je ct ( o r NULL, in ca s e o f fa ilu re ) . It a ct s o n t wo p a ra m e t e rs : t h e a d d re s s dir o f t h e in o d e o b je ct t h a t re fe rs t o t h e d ire ct o ry in t o wh ich t h e n e w in o d e m u s t b e in s e rt e d a n d a mode t h a t in d ica t e s t h e t yp e o f in o d e b e in g cre a t e d . Th e la t t e r a rg u m e n t a ls o in clu d e s a n MS_SYNCHRONOUS fla g t h a t re q u ire s t h e cu rre n t p ro ce s s t o b e s u s p e n d e d u n t il t h e in o d e is a llo ca t e d . Th e fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s new_inode( ) t o a llo ca t e a n e w in o d e o b je ct a n d in it ia lize s it s i_sb fie ld t o t h e s u p e rb lo ck a d d re s s s t o re d in dir->i_sb.

2 . In vo ke s down( ) o n t h e s_lock s e m a p h o re in clu d e d in t h e p a re n t s u p e rb lo ck. As we kn o w, t h e ke rn e l s u s p e n d s t h e cu rre n t p ro ce s s if t h e s e m a p h o re is a lre a d y b u s y. 3 . If t h e n e w in o d e is a d ire ct o ry, t rie s t o p la ce it s o t h a t d ire ct o rie s a re e ve n ly s ca t t e re d t h ro u g h p a rt ia lly fille d b lo ck g ro u p s . In p a rt icu la r, a llo ca t e s t h e n e w d ire ct o ry in t h e b lo ck g ro u p t h a t h a s t h e m a xim u m n u m b e r o f fre e b lo cks a m o n g a ll b lo ck g ro u p s t h a t h a ve a g re a t e r t h a n a ve ra g e n u m b e r o f fre e in o d e s . ( Th e a ve ra g e is t h e t o t a l n u m b e r o f fre e in o d e s d ivid e d b y t h e n u m b e r o f b lo ck g ro u p s ) .

4 . If t h e n e w in o d e is n o t a d ire ct o ry, a llo ca t e s it in a b lo ck g ro u p h a vin g a fre e in o d e . Th e fu n ct io n s e le ct s t h e g ro u p b y s t a rt in g fro m t h e o n e t h a t co n t a in s t h e p a re n t d ire ct o ry a n d m o vin g fa rt h e r a wa y fro m it ; t o b e p re cis e : a . Pe rfo rm s a q u ick lo g a rit h m ic s e a rch s t a rt in g fro m t h e b lo ck g ro u p t h a t in clu d e s t h e p a re n t d ire ct o ry dir. Th e a lg o rit h m s e a rch e s lo g ( n ) b lo ck g ro u p s , wh e re n is t h e t o t a l n u m b e r o f b lo ck g ro u p s . Th e a lg o rit h m ju m p s fu rt h e r a h e a d u n t il it fin d s a n a va ila b le b lo ck g ro u p — fo r e xa m p le , if we ca ll t h e n u m b e r o f t h e s t a rt in g b lo ck g ro u p i, t h e a lg o rit h m co n s id e rs b lo ck g ro u p s i m o d ( n ) , i+ 1 m o d ( n ) , i+ 1 + 2 m o d ( n ) , i+ 1 + 2 + 4 m o d ( n ) , e t c. b . If t h e lo g a rit h m ic s e a rch fa ile d in fin d in g a b lo ck g ro u p wit h a fre e in o d e , t h e fu n ct io n p e rfo rm s a n e xh a u s t ive lin e a r s e a rch s t a rt in g fro m t h e b lo ck g ro u p t h a t in clu d e s t h e p a re n t d ire ct o ry dir.

5 . In vo ke s load_inode_bitmap( ) t o g e t t h e in o d e b it m a p o f t h e s e le ct e d b lo ck g ro u p a n d s e a rch e s fo r t h e firs t n u ll b it in t o it , t h u s o b t a in in g t h e n u m b e r o f t h e firs t fre e d is k in o d e . 6 . Allo ca t e s t h e d is k in o d e : s e t s t h e co rre s p o n d in g b it in t h e in o d e b it m a p a n d m a rks t h e b u ffe r co n t a in in g t h e b it m a p a s d irt y. Mo re o ve r, if t h e file s ys t e m h a s b e e n m o u n t e d s p e cifyin g t h e MS_SYNCHRONOUS fla g , in vo ke s ll_rw_block( ) a n d wa it s u n t il t h e writ e o p e ra t io n t e rm in a t e s ( s e e S e ct io n 1 2 . 4 . 2 ) . 7 . De cre m e n t s t h e bg_free_inodes_count fie ld o f t h e g ro u p d e s crip t o r. If t h e n e w in o d e is a d ire ct o ry, in cre m e n t s t h e bg_used_dirs_count fie ld . Ma rks t h e b u ffe r co n t a in in g t h e g ro u p d e s crip t o r a s d irt y. 8 . De cre m e n t s t h e s_free_inodes_count fie ld o f t h e d is k s u p e rb lo ck a n d m a rks t h e b u ffe r co n t a in in g it a s d irt y. S e t s t h e s_dirt fie ld o f t h e VFS 's s u p e rb lo ck o b je ct t o 1. 9 . In it ia lize s t h e fie ld s o f t h e in o d e o b je ct . In p a rt icu la r, s e t s t h e in o d e n u m b e r i_no a n d co p ie s t h e va lu e o f xtime.tv_sec in t o i_atime, i_mtime, a n d i_ctime. Als o lo a d s t h e i_block_group fie ld in t h e ext2_inode_info s t ru ct u re wit h t h e b lo ck g ro u p in d e x. Re fe r t o Ta b le 1 7 - 3 fo r t h e m e a n in g o f t h e s e fie ld s . 1 0 . In s e rt s t h e n e w in o d e o b je ct in t o t h e h a s h t a b le inode_hashtable a n d in vo ke s

mark_inode_dirty( ) t o m o ve t h e in o d e o b je ct in t o t h e s u p e rb lo ck's d irt y in o d e lis t ( s e e S e ct io n 1 2 . 2 . 2 ) . 1 1 . In vo ke s up( ) o n t h e s_lock s e m a p h o re in clu d e d in t h e p a re n t s u p e rb lo ck.

1 2 . Re t u rn s t h e a d d re s s o f t h e n e w in o d e o b je ct .

17.6.2 Deleting Inodes Th e ext2_free_inode( ) fu n ct io n d e le t e s a d is k in o d e , wh ich is id e n t ifie d b y a n in o d e

o b je ct wh o s e a d d re s s is p a s s e d a s t h e p a ra m e t e r. Th e ke rn e l s h o u ld in vo ke t h e fu n ct io n a ft e r a s e rie s o f cle a n u p o p e ra t io n s in vo lvin g in t e rn a l d a t a s t ru ct u re s a n d t h e d a t a in t h e file it s e lf. It s h o u ld co m e a ft e r t h e in o d e o b je ct h a s b e e n re m o ve d fro m t h e in o d e h a s h t a b le , a ft e r t h e la s t h a rd lin k re fe rrin g t o t h a t in o d e h a s b e e n d e le t e d fro m t h e p ro p e r d ire ct o ry a n d a ft e r t h e file is t ru n ca t e d t o 0 le n g t h t o re cla im a ll it s d a t a b lo cks ( s e e S e ct io n 1 7 . 6 . 6 la t e r in t h is ch a p t e r) . It p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s down( ) o n t h e s_lock s e m a p h o re in clu d e d in t h e p a re n t s u p e rb lo ck t o g e t e xclu s ive a cce s s t o t h e s u p e rb lo ck o b je ct . 2 . In vo ke s clear_inode( ) t o p e rfo rm t h e fo llo win g o p e ra t io n s :

a . In vo ke s invalidate_inode_buffers( ) t o re m o ve t h e d irt y b u ffe rs t h a t b e lo n g t o t h e in o d e fro m it s i_dirty_buffers a n d

i_dirty_data_buffers lis t s ( s e e S e ct io n 1 4 . 2 . 1 ) . b . If t h e I_LOCK fla g o f t h e in o d e is s e t , s o m e o f t h e in o d e 's b u ffe rs a re in vo lve d in I/ O d a t a t ra n s fe rs ; t h e fu n ct io n s u s p e n d s t h e cu rre n t p ro ce s s u n t il t h e s e I/ O d a t a t ra n s fe rs t e rm in a t e . c. In vo ke s t h e clear_inode m e t h o d o f t h e s u p e rb lo ck o b je ct , if d e fin e d ; t h e Ext 2 file s ys t e m d o e s n o t d e fin e it . d . S e t s t h e s t a t e o f t h e in o d e t o I_CLEAR ( t h e in o d e o b je ct co n t e n t s a re n o lo n g e r m e a n in g fu l) . 3 . Co m p u t e s t h e in d e x o f t h e b lo ck g ro u p co n t a in in g t h e d is k in o d e fro m t h e in o d e n u m b e r a n d t h e n u m b e r o f in o d e s in e a ch b lo ck g ro u p . 4 . In vo ke s load_inode_bitmap( ) t o g e t t h e in o d e b it m a p .

5 . In cre m e n t s t h e bg_free_inodes_count fie ld o f t h e g ro u p d e s crip t o r. If t h e d e le t e d in o d e is a d ire ct o ry, d e cre m e n t s t h e bg_used_dirs_count fie ld . Ma rks t h e b u ffe r t h a t co n t a in s t h e g ro u p d e s crip t o r a s d irt y. 6 . In cre m e n t s t h e s_free_inodes_count fie ld o f t h e d is k s u p e rb lo ck a n d m a rks t h e b u ffe r t h a t co n t a in s it a s d irt y. Als o s e t s t h e s_dirt fie ld o f t h e s u p e rb lo ck o b je ct t o 1. 7 . Cle a rs t h e b it co rre s p o n d in g t o t h e d is k in o d e in t h e in o d e b it m a p a n d m a rks t h e b u ffe r t h a t co n t a in s t h e b it m a p a s d irt y. Mo re o ve r, if t h e file s ys t e m h a s b e e n m o u n t e d wit h t h e MS_SYNCHRONIZE fla g , in vo ke s ll_rw_block( ) a n d wa it s u n t il t h e writ e o p e ra t io n o n t h e b it m a p 's b u ffe r t e rm in a t e s . 8 . In vo ke s up( ) o n t h e s_lock s e m a p h o re in clu d e d in t h e p a re n t s u p e rb lo ck o b je ct .

17.6.3 Data Blocks Addressing Ea ch n o n e m p t y re g u la r file co n s is t s o f a g ro u p o f d a t a b lo cks . S u ch b lo cks m a y b e re fe rre d

t o e it h e r b y t h e ir re la t ive p o s it io n in s id e t h e file ( t h e ir file b lo ck n u m b e r) o r b y t h e ir p o s it io n in s id e t h e d is k p a rt it io n ( t h e ir lo g ica l b lo ck n u m b e r, e xp la in e d in S e ct io n 1 3 . 4 . 4 ) . De rivin g t h e lo g ica l b lo ck n u m b e r o f t h e co rre s p o n d in g d a t a b lo ck fro m a n o ffs e t f in s id e a file is a t wo - s t e p p ro ce s s : 1 . De rive fro m t h e o ffs e t f t h e file b lo ck n u m b e r — t h e in d e x o f t h e b lo ck t h a t co n t a in s t h e ch a ra ct e r a t o ffs e t f. 2 . Tra n s la t e t h e file b lo ck n u m b e r t o t h e co rre s p o n d in g lo g ica l b lo ck n u m b e r. S in ce Un ix file s d o n o t in clu d e a n y co n t ro l ch a ra ct e rs , it is q u it e e a s y t o d e rive t h e file b lo ck n u m b e r co n t a in in g t h e f t h ch a ra ct e r o f a file : s im p ly t a ke t h e q u o t ie n t o f f a n d t h e file s ys t e m 's b lo ck s ize a n d ro u n d d o wn t o t h e n e a re s t in t e g e r. Fo r in s t a n ce , le t 's a s s u m e a b lo ck s ize o f 4 KB. If f is s m a lle r t h a n 4 , 0 9 6 , t h e ch a ra ct e r is co n t a in e d in t h e firs t d a t a b lo ck o f t h e file , wh ich h a s file b lo ck n u m b e r 0 . If f is e q u a l t o o r g re a t e r t h a n 4 , 0 9 6 a n d le s s t h a n 8 , 1 9 2 , t h e ch a ra ct e r is co n t a in e d in t h e d a t a b lo ck t h a t h a s file b lo ck n u m b e r 1 , a n d s o o n . Th is is fin e a s fa r a s file b lo ck n u m b e rs a re co n ce rn e d . Ho we ve r, t ra n s la t in g a file b lo ck n u m b e r in t o t h e co rre s p o n d in g lo g ica l b lo ck n u m b e r is n o t n e a rly a s s t ra ig h t fo rwa rd , s in ce t h e d a t a b lo cks o f a n Ext 2 file a re n o t n e ce s s a rily a d ja ce n t o n d is k. Th e Ext 2 file s ys t e m m u s t t h e re fo re p ro vid e a m e t h o d t o s t o re t h e co n n e ct io n b e t we e n e a ch file b lo ck n u m b e r a n d t h e co rre s p o n d in g lo g ica l b lo ck n u m b e r o n d is k. Th is m a p p in g , wh ich g o e s b a ck t o e a rly ve rs io n s o f Un ix fro m AT&T, is im p le m e n t e d p a rt ly in s id e t h e in o d e . It a ls o in vo lve s s o m e s p e cia lize d b lo cks t h a t co n t a in e xt ra p o in t e rs , wh ich a re a n in o d e e xt e n s io n u s e d t o h a n d le la rg e file s . Th e i_block fie ld in t h e d is k in o d e is a n a rra y o f EXT2_N_BLOCKS co m p o n e n t s t h a t co n t a in lo g ica l b lo ck n u m b e rs . In t h e fo llo win g d is cu s s io n , we a s s u m e t h a t EXT2_N_BLOCKS h a s t h e d e fa u lt va lu e , n a m e ly 1 5 . Th e a rra y re p re s e n t s t h e in it ia l p a rt o f a la rg e r d a t a s t ru ct u re , wh ich is illu s t ra t e d in Fig u re 1 7 - 5 . As ca n b e s e e n in t h e fig u re , t h e 1 5 co m p o n e n t s o f t h e a rra y a re o f 4 d iffe re n t t yp e s : ●







Th e firs t 1 2 co m p o n e n t s yie ld t h e lo g ica l b lo ck n u m b e rs co rre s p o n d in g t o t h e firs t 1 2 b lo cks o f t h e file —t o t h e b lo cks t h a t h a ve file b lo ck n u m b e rs fro m 0 t o 1 1 . Th e co m p o n e n t a t in d e x 1 2 co n t a in s t h e lo g ica l b lo ck n u m b e r o f a b lo ck t h a t re p re s e n t s a s e co n d - o rd e r a rra y o f lo g ica l b lo ck n u m b e rs . Th e y co rre s p o n d t o t h e file b lo ck n u m b e rs ra n g in g fro m 1 2 t o b / 4 + 1 1 , wh e re b is t h e file s ys t e m 's b lo ck s ize ( e a ch lo g ica l b lo ck n u m b e r is s t o re d in 4 b yt e s , s o we d ivid e b y 4 in t h e fo rm u la ) . Th e re fo re , t h e ke rn e l m u s t lo o k in t h is co m p o n e n t fo r a p o in t e r t o a b lo ck, a n d t h e n lo o k in t h a t b lo ck fo r a n o t h e r p o in t e r t o t h e u lt im a t e b lo ck t h a t co n t a in s t h e file co n t e n t s . Th e co m p o n e n t a t in d e x 1 3 co n t a in s t h e lo g ica l b lo ck n u m b e r o f a b lo ck co n t a in in g a s e co n d - o rd e r a rra y o f lo g ica l b lo ck n u m b e rs ; in t u rn , t h e e n t rie s o f t h is s e co n d - o rd e r a rra y p o in t t o t h ird - o rd e r a rra ys , wh ich s t o re t h e lo g ica l b lo ck n u m b e rs t h a t co rre s p o n d t o t h e file b lo ck n u m b e rs ra n g in g fro m b / 4 + 1 2 t o ( b / 4 ) 2 + ( b / 4 ) + 1 1 . Fin a lly, t h e co m p o n e n t a t in d e x 1 4 u s e s t rip le in d ire ct io n : t h e fo u rt h - o rd e r a rra ys s t o re t h e lo g ica l b lo ck n u m b e rs co rre s p o n d in g t o t h e file b lo ck n u m b e rs ra n g in g fro m ( b / 4 ) 2 + ( b / 4 ) + 1 2 t o ( b / 4 ) 3 + ( b / 4 ) 2 + ( b / 4 ) + 1 1 u p wa rd .

Fig u re 1 7 - 5 . D a t a s t ru c t u re s u s e d t o a d d re s s t h e file 's d a t a b lo c k s

In Fig u re 1 7 - 5 , t h e n u m b e r in s id e a b lo ck re p re s e n t s t h e co rre s p o n d in g file b lo ck n u m b e r. Th e a rro ws , wh ich re p re s e n t lo g ica l b lo ck n u m b e rs s t o re d in a rra y co m p o n e n t s , s h o w h o w t h e ke rn e l fin d s it s wa y t o re a ch t h e b lo ck t h a t co n t a in s t h e a ct u a l co n t e n t s o f t h e file . No t ice h o w t h is m e ch a n is m fa vo rs s m a ll file s . If t h e file d o e s n o t re q u ire m o re t h a n 1 2 d a t a b lo cks , a n y d a t a ca n b e re t rie ve d in t wo d is k a cce s s e s : o n e t o re a d a co m p o n e n t in t h e i_block a rra y o f t h e d is k in o d e a n d t h e o t h e r t o re a d t h e re q u e s t e d d a t a b lo ck. Fo r la rg e r file s , h o we ve r, t h re e o r e ve n fo u r co n s e cu t ive d is k a cce s s e s m a y b e n e e d e d t o a cce s s t h e re q u ire d b lo ck. In p ra ct ice , t h is is a wo rs t - ca s e e s t im a t e , s in ce d e n t ry, b u ffe r, a n d p a g e ca ch e s co n t rib u t e s ig n ifica n t ly t o re d u ce t h e n u m b e r o f re a l d is k a cce s s e s . No t ice a ls o h o w t h e b lo ck s ize o f t h e file s ys t e m a ffe ct s t h e a d d re s s in g m e ch a n is m , s in ce a la rg e r b lo ck s ize a llo ws t h e Ext 2 t o s t o re m o re lo g ica l b lo ck n u m b e rs in s id e a s in g le b lo ck. Ta b le 1 7 - 1 0 s h o ws t h e u p p e r lim it p la ce d o n a file 's s ize fo r e a ch b lo ck s ize a n d e a ch a d d re s s in g m o d e . Fo r in s t a n ce , if t h e b lo ck s ize is 1 , 0 2 4 b yt e s a n d t h e file co n t a in s u p t o 2 6 8 kilo b yt e s o f d a t a , t h e firs t 1 2 KB o f a file ca n b e a cce s s e d t h ro u g h d ire ct m a p p in g a n d t h e re m a in in g 1 3 - 2 6 8 KB ca n b e a d d re s s e d t h ro u g h s im p le in d ire ct io n . File s la rg e r t h a n 2 GB m u s t b e o p e n e d o n 3 2 - b it a rch it e ct u re s b y s p e cifyin g t h e O_LARGEFILE o p e n in g fla g . In a n y ca s e , t h e Ext 2 file s ys t e m p u t s a n u p p e r lim it o n t h e file s ize e q u a l t o 2 TB m in u s 4 , 0 9 6 b yt e s .

Ta b le 1 7 - 1 0 . File s iz e u p p e r lim it s fo r d a t a b lo c k a d d re s s in g

Blo c k S iz e

D ire c t

1 - I n d ire c t

2 - I n d ire c t

3 - I n d ire c t

1,024

1 2 KB

2 6 8 KB

6 4 . 2 6 MB

1 6 . 0 6 GB

2,048

2 4 KB

1 . 0 2 MB

5 1 3 . 0 2 MB

2 5 6 . 5 GB

4,096

4 8 KB

4 . 0 4 MB

4 GB

~ 2 TB

17.6.4 File Holes A file h o le is a p o rt io n o f a re g u la r file t h a t co n t a in s n u ll ch a ra ct e rs a n d is n o t s t o re d in a n y d a t a b lo ck o n d is k. Ho le s a re a lo n g - s t a n d in g fe a t u re o f Un ix file s . Fo r in s t a n ce , t h e fo llo win g Un ix co m m a n d cre a t e s a file in wh ich t h e firs t b yt e s a re a h o le :

$ echo -n "X" | dd of=/tmp/hole bs=1024 seek=6 No w / t m p / h o le h a s 6 , 1 4 5 ch a ra ct e rs ( 6 , 1 4 4 n u ll ch a ra ct e rs p lu s a n X ch a ra ct e r) , ye t t h e file o ccu p ie s ju s t o n e d a t a b lo ck o n d is k. File h o le s we re in t ro d u ce d t o a vo id wa s t in g d is k s p a ce . Th e y a re u s e d e xt e n s ive ly b y d a t a b a s e a p p lica t io n s a n d , m o re g e n e ra lly, b y a ll a p p lica t io n s t h a t p e rfo rm h a s h in g o n file s . Th e Ext 2 im p le m e n t a t io n o f file h o le s is b a s e d o n d yn a m ic d a t a b lo ck a llo ca t io n : a b lo ck is a ct u a lly a s s ig n e d t o a file o n ly wh e n t h e p ro ce s s n e e d s t o writ e d a t a in t o it . Th e i_size fie ld o f e a ch in o d e d e fin e s t h e s ize o f t h e file a s s e e n b y t h e p ro g ra m , in clu d in g t h e h o le , wh ile t h e i_blocks fie ld s t o re s t h e n u m b e r o f d a t a b lo cks e ffe ct ive ly a s s ig n e d t o t h e file ( in u n it s o f 5 1 2 b yt e s ) . In o u r e a rlie r e xa m p le o f t h e dd co m m a n d , s u p p o s e t h e / t m p / h o le file wa s cre a t e d o n a n Ext 2 p a rt it io n t h a t h a s b lo cks o f s ize 4 , 0 9 6 . Th e i_size fie ld o f t h e co rre s p o n d in g d is k in o d e s t o re s t h e n u m b e r 6 , 1 4 5 , wh ile t h e i_blocks fie ld s t o re s t h e n u m b e r 8 ( b e ca u s e e a ch 4 , 0 9 6 - b yt e b lo ck in clu d e s e ig h t 5 1 2 - b yt e b lo cks ) . Th e s e co n d e le m e n t o f t h e i_block a rra y ( co rre s p o n d in g t o t h e b lo ck h a vin g file b lo ck n u m b e r 1 ) s t o re s t h e lo g ica l b lo ck n u m b e r o f t h e a llo ca t e d b lo ck, wh ile a ll o t h e r e le m e n t s in t h e a rra y a re n u ll ( s e e Fig u re 1 7 6). Fig u re 1 7 - 6 . A file w it h a n in it ia l h o le

17.6.5 Allocating a Data Block

Wh e n t h e ke rn e l h a s t o lo ca t e a b lo ck h o ld in g d a t a fo r a n Ext 2 re g u la r file , it in vo ke s t h e ext2_get_block( ) fu n ct io n . If t h e b lo ck d o e s n o t e xis t , t h e fu n ct io n a u t o m a t ica lly a llo ca t e s t h e b lo ck t o t h e file . Re m e m b e r t h a t t h is fu n ct io n is in vo ke d e ve ry t im e t h e ke rn e l is s u e s a re a d o r writ e o p e ra t io n o n a Ext 2 re g u la r file ( s e e S e ct io n 1 5 . 1 . 1 a n d S e ct io n 15.1.3). Th e ext2_get_block( ) fu n ct io n h a n d le s t h e d a t a s t ru ct u re s a lre a d y d e s crib e d in S e ct io n 1 7 . 6 . 3 , a n d wh e n n e ce s s a ry, in vo ke s t h e ext2_alloc_block( ) fu n ct io n t o a ct u a lly s e a rch fo r a fre e b lo ck in t h e Ext 2 p a rt it io n . To re d u ce file fra g m e n t a t io n , t h e Ext 2 file s ys t e m t rie s t o g e t a n e w b lo ck fo r a file n e a r t h e la s t b lo ck a lre a d y a llo ca t e d fo r t h e file . Fa ilin g t h a t , t h e file s ys t e m s e a rch e s fo r a n e w b lo ck in t h e b lo ck g ro u p t h a t in clu d e s t h e file 's in o d e . As a la s t re s o rt , t h e fre e b lo ck is t a ke n fro m o n e o f t h e o t h e r b lo ck g ro u p s . Th e Ext 2 file s ys t e m u s e s p re a llo ca t io n o f d a t a b lo cks . Th e file d o e s n o t g e t ju s t t h e re q u e s t e d b lo ck, b u t ra t h e r a g ro u p o f u p t o e ig h t a d ja ce n t b lo cks . Th e i_prealloc_count fie ld in t h e ext2_inode_info s t ru ct u re s t o re s t h e n u m b e r o f d a t a b lo cks p re a llo ca t e d t o a file t h a t a re s t ill u n u s e d , a n d t h e i_prealloc_block fie ld s t o re s t h e lo g ica l b lo ck n u m b e r o f t h e n e xt p re a llo ca t e d b lo ck t o b e u s e d . An y p re a llo ca t e d b lo cks t h a t re m a in u n u s e d a re fre e d wh e n t h e file is clo s e d , wh e n it is t ru n ca t e d , o r wh e n a writ e o p e ra t io n is n o t s e q u e n t ia l wit h re s p e ct t o t h e writ e o p e ra t io n t h a t t rig g e re d t h e b lo ck p re a llo ca t io n . Th e ext2_alloc_block( ) fu n ct io n re ce ive s a s p a ra m e t e rs a p o in t e r t o a n in o d e o b je ct a n d a g o a l. Th e g o a l is a lo g ica l b lo ck n u m b e r t h a t re p re s e n t s t h e p re fe rre d p o s it io n o f t h e n e w b lo ck. Th e ext2_getblk( ) fu n ct io n s e t s t h e g o a l p a ra m e t e r a cco rd in g t o t h e fo llo win g h e u ris t ic: 1 . If t h e b lo ck t h a t is b e in g a llo ca t e d a n d t h e p re vio u s ly a llo ca t e d b lo ck h a ve co n s e cu t ive file b lo ck n u m b e rs , t h e g o a l is t h e lo g ica l b lo ck n u m b e r o f t h e p re vio u s b lo ck p lu s 1 ; it m a ke s s e n s e t h a t co n s e cu t ive b lo cks a s s e e n b y a p ro g ra m s h o u ld b e a d ja ce n t o n d is k. 2 . If t h e firs t ru le d o e s n o t a p p ly a n d a t le a s t o n e b lo ck h a s b e e n p re vio u s ly a llo ca t e d t o t h e file , t h e g o a l is o n e o f t h e s e b lo cks ' lo g ica l b lo ck n u m b e rs . Mo re p re cis e ly, it is t h e lo g ica l b lo ck n u m b e r o f t h e a lre a d y a llo ca t e d b lo ck t h a t p re ce d e s t h e b lo ck t o b e a llo ca t e d in t h e file . 3 . If t h e p re ce d in g ru le s d o n o t a p p ly, t h e g o a l is t h e lo g ica l b lo ck n u m b e r o f t h e firs t b lo ck ( n o t n e ce s s a rily fre e ) in t h e b lo ck g ro u p t h a t co n t a in s t h e file 's in o d e . Th e ext2_alloc_block( ) fu n ct io n ch e cks wh e t h e r t h e g o a l re fe rs t o o n e o f t h e p re a llo ca t e d b lo cks o f t h e file . If s o , it a llo ca t e s t h e co rre s p o n d in g b lo ck a n d re t u rn s it s lo g ica l b lo ck n u m b e r; o t h e rwis e , t h e fu n ct io n d is ca rd s a ll re m a in in g p re a llo ca t e d b lo cks a n d in vo ke s ext2_new_block( ).

Th is la t t e r fu n ct io n s e a rch e s fo r a fre e b lo ck in s id e t h e Ext 2 p a rt it io n wit h t h e fo llo win g s t ra t e g y: 1 . If t h e p re fe rre d b lo ck p a s s e d t o ext2_alloc_block( ), t h e g o a l, is fre e , a n d t h e

fu n ct io n a llo ca t e s t h e b lo ck. 2 . If t h e g o a l is b u s y, t h e fu n ct io n ch e cks wh e t h e r o n e o f t h e n e xt 6 4 b lo cks a ft e r t h e p re fe rre d b lo ck is fre e . 3 . If n o fre e b lo ck is fo u n d in t h e n e a r vicin it y o f t h e p re fe rre d b lo ck, t h e fu n ct io n co n s id e rs a ll b lo ck g ro u p s , s t a rt in g fro m t h e o n e in clu d in g t h e g o a l. Fo r e a ch b lo ck g ro u p , t h e fu n ct io n d o e s t h e fo llo win g : a . Lo o ks fo r a g ro u p o f a t le a s t e ig h t a d ja ce n t fre e b lo cks . b . If n o s u ch g ro u p is fo u n d , lo o ks fo r a s in g le fre e b lo ck. Th e s e a rch e n d s a s s o o n a s a fre e b lo ck is fo u n d . Be fo re t e rm in a t in g , t h e

ext2_new_block( ) fu n ct io n a ls o t rie s t o p re a llo ca t e u p t o e ig h t fre e b lo cks a d ja ce n t t o t h e fre e b lo ck fo u n d a n d s e t s t h e i_prealloc_block a n d i_prealloc_count fie ld s o f t h e d is k in o d e t o t h e p ro p e r b lo ck lo ca t io n a n d n u m b e r o f b lo cks .

17.6.6 Releasing a Data Block Wh e n a p ro ce s s d e le t e s a file o r t ru n ca t e s it t o 0 le n g t h , a ll it s d a t a b lo cks m u s t b e re cla im e d . Th is is d o n e b y ext2_truncate( ), wh ich re ce ive s t h e a d d re s s o f t h e file 's in o d e o b je ct a s it s p a ra m e t e r. Th e fu n ct io n e s s e n t ia lly s ca n s t h e d is k in o d e 's i_block a rra y t o lo ca t e a ll d a t a b lo cks a n d a ll b lo cks u s e d fo r t h e in d ire ct a d d re s s in g . Th e s e b lo cks a re t h e n re le a s e d b y re p e a t e d ly in vo kin g ext2_free_blocks( ).

Th e ext2_free_blocks( ) fu n ct io n re le a s e s a g ro u p o f o n e o r m o re a d ja ce n t d a t a b lo cks . Be s id e s it s u s e b y ext2_truncate( ), t h e fu n ct io n is in vo ke d m a in ly wh e n d is ca rd in g t h e p re a llo ca t e d b lo cks o f a file ( s e e t h e e a rlie r s e ct io n S e ct io n 1 7 . 6 . 5 ) . It s p a ra m e t e rs a re :

inode Th e a d d re s s o f t h e in o d e o b je ct t h a t d e s crib e s t h e file

block Th e lo g ica l b lo ck n u m b e r o f t h e firs t b lo ck t o b e re le a s e d

count Th e n u m b e r o f a d ja ce n t b lo cks t o b e re le a s e d Th e fu n ct io n in vo ke s down( ) o n t h e s_lock s u p e rb lo ck's s e m a p h o re t o g e t e xclu s ive a cce s s t o t h e file s ys t e m 's s u p e rb lo ck, a n d t h e n p e rfo rm s t h e fo llo win g a ct io n s fo r e a ch b lo ck t o b e re le a s e d : 1 . Ge t s t h e b lo ck b it m a p o f t h e b lo ck g ro u p , in clu d in g t h e b lo ck t o b e re le a s e d 2 . Cle a rs t h e b it in t h e b lo ck b it m a p t h a t co rre s p o n d s t o t h e b lo ck t o b e re le a s e d a n d

m a rks t h e b u ffe r t h a t co n t a in s t h e b it m a p a s d irt y 3 . In cre m e n t s t h e bg_free_blocks_count fie ld in t h e b lo ck g ro u p d e s crip t o r a n d m a rks t h e co rre s p o n d in g b u ffe r a s d irt y 4 . In cre m e n t s t h e s_free_blocks_count fie ld o f t h e d is k s u p e rb lo ck, m a rks t h e co rre s p o n d in g b u ffe r a s d irt y, a n d s e t s t h e s_dirt fla g o f t h e s u p e rb lo ck o b je ct

5 . If t h e file s ys t e m h a s b e e n m o u n t e d wit h t h e MS_SYNCHRONOUS fla g s e t , in vo ke s

ll_rw_block( ) a n d wa it s u n t il t h e writ e o p e ra t io n o n t h e b it m a p 's b u ffe r t e rm in a t e s Fin a lly, t h e fu n ct io n in vo ke s up( ) t o re le a s e t h e s u p e rb lo ck's s_lock s e m a p h o re .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

17.7 The Ext3 Filesystem In t h is s e ct io n we 'll b rie fly d e s crib e t h e e n h a n ce d file s ys t e m t h a t h a s e vo lve d fro m Ext 2 , n a m e d Ex t 3 . Th e n e w file s ys t e m h a s b e e n d e s ig n e d wit h t wo s im p le co n ce p t s in m in d : ● ●

To b e a jo u rn a lin g file s ys t e m ( s e e t h e n e xt s e ct io n ) To b e , a s m u ch a s p o s s ib le , co m p a t ib le wit h t h e o ld Ext 2 file s ys t e m

Ext 3 a ch ie ve s b o t h t h e g o a ls ve ry we ll. In p a rt icu la r, it is la rg e ly b a s e d o n Ext 2 , s o it s d a t a s t ru ct u re s o n d is k a re e s s e n t ia lly id e n t ica l t o t h o s e o f a n Ext 2 file s ys t e m . As a m a t t e r o f fa ct , if a n Ext 3 file s ys t e m h a s b e e n cle a n ly u n m o u n t e d , it ca n b e re m o u n t e d a s a n Ext 2 file s ys t e m ; co n ve rs e ly, cre a t in g a jo u rn a l o f a n Ext 2 file s ys t e m a n d re m o u n t in g it a s a n Ext 3 file s ys t e m is a s im p le , fa s t o p e ra t io n . Th a n ks t o t h e co m p a t ib ilit y b e t we e n Ext 3 a n d Ext 2 , m o s t d e s crip t io n s in t h e p re vio u s s e ct io n s o f t h is ch a p t e r a p p ly t o Ext 3 a s we ll. Th e re fo re , in t h is s e ct io n , we fo cu s o n t h e n e w fe a t u re o ffe re d b y Ext 3 — "t h e jo u rn a l. "

17.7.1 Journaling Filesystems As d is ks b e ca m e la rg e r, o n e d e s ig n ch o ice o f t ra d it io n a l Un ix file s ys t e m s ( like Ext 2 ) t u rn s o u t t o b e in a p p ro p ria t e . As we kn o w fro m Ch a p t e r 1 4 , u p d a t e s t o file s ys t e m b lo cks m ig h t b e ke p t in d yn a m ic m e m o ry fo r lo n g p e rio d o f t im e b e fo re b e in g flu s h e d t o d is k. A d ra m a t ic e ve n t like a p o we r- d o wn fa ilu re o r a s ys t e m cra s h m ig h t t h u s le a ve t h e file s ys t e m in a n in co n s is t e n t s t a t e . To o ve rco m e t h is p ro b le m , e a ch t ra d it io n a l Un ix file s ys t e m is ch e cke d b e fo re b e in g m o u n t e d ; if it h a s n o t b e e n p ro p e rly u n m o u n t e d , t h e n a s p e cific p ro g ra m e xe cu t e s a n e xh a u s t ive , t im e - co n s u m in g ch e ck a n d fixe s a ll file s ys t e m 's d a t a s t ru ct u re s o n d is k. Fo r in s t a n ce , t h e Ext 2 file s ys t e m s t a t u s is s t o re d in t h e s_mount_state fie ld o f t h e s u p e rb lo ck o n d is k. Th e e 2 fs ck u t ilit y p ro g ra m is in vo ke d b y t h e b o o t s crip t t o ch e ck t h e va lu e s t o re d in t h is fie ld ; if it is n o t e q u a l t o EXT2_VALID_FS, t h e file s ys t e m wa s n o t p ro p e rly u n m o u n t e d , a n d t h e re fo re e 2 fs ck s t a rt s ch e ckin g a ll d is k d a t a s t ru ct u re s o f t h e file s ys t e m . Cle a rly, t h e t im e s p e n t ch e ckin g t h e co n s is t e n cy o f a file s ys t e m d e p e n d s m a in ly o n t h e n u m b e r o f file s a n d d ire ct o rie s t o b e e xa m in e d ; t h e re fo re , it a ls o d e p e n d s o n t h e d is k s ize . No wa d a ys , wit h file s ys t e m s re a ch in g h u n d re d s o f g ig a b yt e s , a s in g le co n s is t e n cy ch e ck m a y t a ke h o u rs . Th e in vo lve d d o wn t im e is u n a cce p t a b le fo r a n y p ro d u ct io n e n viro n m e n t o r h ig h a va ila b ilit y s e rve r. Th e g o a l o f a jo u rn a lin g file s y s t e m is t o a vo id ru n n in g t im e - co n s u m in g co n s is t e n cy ch e cks o n t h e wh o le file s ys t e m b y lo o kin g in s t e a d in a s p e cia l d is k a re a t h a t co n t a in s t h e m o s t re ce n t d is k writ e o p e ra t io n s n a m e d jo u rn a l. Re m o u n t in g a jo u rn a lin g file s ys t e m a ft e r a s ys t e m fa ilu re is a m a t t e r o f fe w s e co n d s .

17.7.2 The Ext3 Journaling Filesystem Th e id e a b e h in d Ext 3 jo u rn a lin g is t o p e rfo rm a n y h ig h - le ve l ch a n g e t o t h e file s ys t e m in t wo s t e p s . Firs t , a co p y o f t h e b lo cks t o b e writ t e n is s t o re d in t h e jo u rn a l; t h e n , wh e n t h e I/ O d a t a t ra n s fe r t o t h e jo u rn a l is co m p le t e d ( in s h o rt , d a t a is co m m it t e d t o t h e jo u rn a l) , t h e

b lo cks a re writ t e n in t h e file s ys t e m . Wh e n t h e I/ O d a t a t ra n s fe r t o t h e file s ys t e m t e rm in a t e s ( d a t a is co m m it t e d t o t h e file s y s t e m ) , t h e co p ie s o f t h e b lo cks in t h e jo u rn a l a re d is ca rd e d . Wh ile re co ve rin g a ft e r a s ys t e m fa ilu re , t h e e 2 fs ck p ro g ra m d is t in g u is h e s t h e fo llo win g t wo ca s e s : ●



Th e s y s t e m fa ilu r e o c c u r r e d b e fo r e a c o m m it t o t h e j o u r n a l. Eit h e r t h e co p ie s o f t h e b lo cks re la t ive t o t h e h ig h - le ve l ch a n g e a re m is s in g fro m t h e jo u rn a l o r t h e y a re in co m p le t e ; in b o t h ca s e s , e 2 fs ck ig n o re s t h e m . Th e s y s t e m fa ilu r e o c c u r r e d a ft e r a c o m m it t o t h e j o u r n a l. Th e co p ie s o f t h e b lo cks a re va lid a n d e 2 fs ck writ e s t h e m in t o t h e file s ys t e m .

In t h e firs t ca s e , t h e h ig h - le ve l ch a n g e t o t h e file s ys t e m is lo s t , b u t t h e file s ys t e m s t a t e is s t ill co n s is t e n t . In t h e s e co n d ca s e , e 2 fs ck a p p lie s t h e wh o le h ig h - le ve l ch a n g e , t h u s fixin g a n y in co n s is t e n cy d u e t o u n fin is h e d I/ O d a t a t ra n s fe rs in t o t h e file s ys t e m . Do n 't e xp e ct t o o m u ch fro m a jo u rn a lin g file s ys t e m ; it e n s u re s co n s is t e n cy o n ly a t t h e s ys t e m ca ll le ve l. Fo r in s t a n ce , a s ys t e m fa ilu re t h a t o ccu rs wh ile yo u a re co p yin g a la rg e file b y is s u in g s e ve ra l write( ) s ys t e m ca lls will in t e rru p t t h e co p y o p e ra t io n , t h u s t h e d u p lica t e d file will b e s h o rt e r t h a n t h e o rig in a l o n e . Fu rt h e rm o re , jo u rn a lin g file s ys t e m s d o n o t u s u a lly co p y a ll b lo cks in t o t h e jo u rn a l. In fa ct , e a ch file s ys t e m co n s is t s o f t wo kin d s o f b lo cks : t h o s e co n t a in in g t h e s o - ca lle d m e t a d a t a a n d t h o s e co n t a in in g re g u la r d a t a . In t h e ca s e o f Ext 2 a n d Ext 3 , t h e re a re s ix kin d s o f m e t a d a t a : s u p e rb lo cks , g ro u p b lo ck d e s crip t o rs , in o d e s , b lo cks u s e d fo r in d ire ct a d d re s s in g ( in d ire ct io n b lo cks ) , d a t a b it m a p b lo cks , a n d in o d e b it m a p b lo cks . Ot h e r file s ys t e m s m a y u s e d iffe re n t m e ta da ta . Mo s t jo u rn a lin g file s ys t e m s , like Re is e rFS , S GI's XFS , a n d IBM's JFS , lim it t h e m s e lve s t o lo g t h e o p e ra t io n s a ffe ct in g m e t a d a t a . In fa ct , m e t a d a t a 's lo g re co rd s a re s u fficie n t t o re s t o re t h e co n s is t e n cy o f t h e o n - d is k file s ys t e m d a t a s t ru ct u re s . Ho we ve r, s in ce o p e ra t io n s o n b lo cks o f file d a t a a re n o t lo g g e d , n o t h in g p re ve n t s a s ys t e m fa ilu re fro m co rru p t in g t h e co n t e n t s o f t h e file s . Th e Ext 3 file s ys t e m , h o we ve r, ca n b e co n fig u re d t o lo g t h e o p e ra t io n s a ffe ct in g b o t h t h e file s ys t e m m e t a d a t a a n d t h e d a t a b lo cks o f t h e file s . S in ce lo g g in g e ve ry kin d o f writ e o p e ra t io n le a d s t o a s ig n ifica n t p e rfo rm a n ce p e n a lt y, Ext 3 le t s t h e s ys t e m a d m in is t ra t o r d e cid e wh a t h a s t o b e lo g g e d ; in p a rt icu la r, it o ffe rs t h re e d iffe re n t jo u rn a lin g m o d e s : Jo u rn a l All file s ys t e m d a t a a n d m e t a d a t a ch a n g e s a re lo g g e d in t o t h e jo u rn a l. Th is m o d e m in im ize s t h e ch a n ce o f lo s in g t h e u p d a t e s m a d e t o e a ch file , b u t it re q u ire s m a n y a d d it io n a l d is k a cce s s e s . Fo r e xa m p le , wh e n a n e w file is cre a t e d , a ll it s d a t a b lo cks m u s t b e d u p lica t e d a s lo g re co rd s . Th is is t h e s a fe s t a n d s lo we s t Ext 3 jo u rn a lin g m ode . Ord e re d On ly ch a n g e s t o file s ys t e m m e t a d a t a a re lo g g e d in t o t h e jo u rn a l. Ho we ve r, t h e Ext 3 file s ys t e m g ro u p s m e t a d a t a a n d re la t ive d a t a b lo cks s o t h a t d a t a b lo cks a re writ t e n t o d is k b e fo re t h e m e t a d a t a . Th is wa y, t h e ch a n ce t o h a ve d a t a co rru p t io n in s id e t h e file s is re d u ce d ; fo r in s t a n ce , a n y writ e a cce s s t h a t e n la rg e s a file is g u a ra n t e e d t o

b e fu lly p ro t e ct e d b y t h e jo u rn a l. Th is is t h e d e fa u lt Ext 3 jo u rn a lin g m o d e . W rit e b a ck On ly ch a n g e s t o file s ys t e m m e t a d a t a a re lo g g e d ; t h is is t h e m e t h o d fo u n d o n t h e o t h e r jo u rn a lin g file s ys t e m s a n d is t h e fa s t e s t m o d e . Th e jo u rn a lin g m o d e o f t h e Ext 3 file s ys t e m is s p e cifie d b y a n o p t io n o f t h e m o u n t s ys t e m co m m a n d . Fo r in s t a n ce , t o m o u n t a n Ext 3 file s ys t e m s t o re d in t h e / d e v / s d a 2 p a rt it io n o n t h e / jd is k m o u n t p o in t wit h t h e "writ e b a ck" m o d e , t h e s ys t e m a d m in is t ra t o r ca n t yp e t h e co m m a n d :

# mount -t ext3 -o data=writeback /dev/sda2 /jdisk 17.7.3 The Journaling Block Device Layer Th e Ext 3 jo u rn a l is u s u a lly s t o re d in a h id d e n file n a m e d . jo u rn a l lo ca t e d in t h e ro o t d ire ct o ry o f t h e file s ys t e m . Th e Ext 3 file s ys t e m d o e s n o t h a n d le t h e jo u rn a l o n it s o wn ; ra t h e r, it u s e s a g e n e ra l ke rn e l la ye r n a m e d Jo u rn a lin g Blo ck De v ice , o r JBD. Rig h t n o w, o n ly Ext 3 u s e s t h e JBD la ye r, b u t o t h e r file s ys t e m s m ig h t u s e it in t h e fu t u re . Th e JBD la ye r is a ra t h e r co m p le x p ie ce o f s o ft wa re . Th e Ext 3 file s ys t e m in vo ke s t h e JBD ro u t in e s t o e n s u re t h a t it s s u b s e q u e n t o p e ra t io n s d o n 't co rru p t t h e d is k d a t a s t ru ct u re s in ca s e o f s ys t e m fa ilu re . Ho we ve r, JBD t yp ica lly u s e s t h e s a m e d is k t o lo g t h e ch a n g e s p e rfo rm e d b y t h e Ext 3 file s ys t e m , a n d it is t h e re fo re vu ln e ra b le t o s ys t e m fa ilu re s a s m u ch a s Ext 3 . In o t h e r wo rd s , JBD m u s t a ls o p ro t e ct it s e lf fro m a n y s ys t e m fa ilu re t h a t co u ld co rru p t t h e jo u rn a l. Th e re fo re , t h e in t e ra ct io n b e t we e n Ext 3 a n d JBD is e s s e n t ia lly b a s e d o n t h re e fu n d a m e n t a l u n it s : Lo g re co rd De s crib e s a s in g le u p d a t e o f a d is k b lo ck o f t h e jo u rn a lin g file s ys t e m . At o m ic o p e ra t io n h a n d le In clu d e s lo g re co rd s re la t ive t o a s in g le h ig h - le ve l ch a n g e o f t h e file s ys t e m ; t yp ica lly, e a ch s ys t e m ca ll m o d ifyin g t h e file s ys t e m g ive s ris e t o a s in g le a t o m ic o p e ra t io n h a n d le . Tra n s a ct io n In clu d e s s e ve ra l a t o m ic o p e ra t io n h a n d le s wh o s e lo g re co rd s a re m a rke d va lid fo r e 2 fs ck a t t h e s a m e t im e .

17.7.3.1 Log records A lo g re co rd is e s s e n t ia lly t h e d e s crip t io n o f a lo w- le ve l o p e ra t io n t h a t is g o in g t o b e is s u e d

b y t h e file s ys t e m . In s o m e jo u rn a lin g file s ys t e m s , t h e lo g re co rd co n s is t s o f e xa ct ly t h e s p a n o f b yt e s m o d ifie d b y t h e o p e ra t io n , t o g e t h e r wit h t h e s t a rt in g p o s it io n o f t h e b yt e s in s id e t h e file s ys t e m . Th e JBD la ye r, h o we ve r, u s e s lo g re co rd s co n s is t in g o f t h e wh o le b u ffe r m o d ifie d b y t h e lo w- le ve l o p e ra t io n . Th is a p p ro a ch m a y wa s t e a lo t o f jo u rn a l s p a ce ( fo r in s t a n ce , wh e n t h e lo w- le ve l o p e ra t io n ju s t ch a n g e s t h e va lu e o f a b it in a b it m a p ) , b u t it is a ls o m u ch fa s t e r b e ca u s e t h e JBD la ye r ca n wo rk d ire ct ly wit h b u ffe rs a n d t h e ir b u ffe r h e a d s . Lo g re co rd s a re t h u s re p re s e n t e d in s id e t h e jo u rn a l a s n o rm a l b lo cks o f d a t a ( o r m e t a d a t a ) . Ea ch s u ch b lo ck, h o we ve r, is a s s o cia t e d wit h a s m a ll t a g o f t yp e journal_block_tag_t, wh ich s t o re s t h e lo g ica l b lo ck n u m b e r o f t h e b lo ck in s id e t h e file s ys t e m a n d a fe w s t a t u s fla g s . La t e r, wh e n e ve r a b u ffe r is b e in g co n s id e re d b y t h e JBD, e it h e r b e ca u s e it b e lo n g s t o a lo g re co rd o r b e ca u s e it is a d a t a b lo ck t h a t s h o u ld b e flu s h e d t o d is k b e fo re t h e co rre s p o n d in g m e t a d a t a b lo ck ( in t h e "o rd e re d " jo u rn a lin g m o d e ) , t h e ke rn e l a t t a ch e s a journal_head d a t a s t ru ct u re t o t h e b u ffe r h e a d . In t h is ca s e , t h e b_private fie ld o f t h e b u ffe r h e a d s t o re s t h e a d d re s s o f t h e journal_head d a t a s t ru ct u re a n d t h e BH_JBD fla g is s e t ( s e e S e ct io n 1 3 . 4 . 4 ) .

17.7.3.2 Atomic operation handles An y s ys t e m ca ll m o d ifyin g t h e file s ys t e m is u s u a lly s p lit in t o a s e rie s o f lo w- le ve l o p e ra t io n s t h a t m a n ip u la t e d is k d a t a s t ru ct u re s . Fo r in s t a n ce , s u p p o s e t h a t Ext 3 m u s t s a t is fy a u s e r re q u e s t t o a p p e n d a b lo ck o f d a t a t o a re g u la r file . Th e file s ys t e m la ye r m u s t d e t e rm in e t h e la s t b lo ck o f t h e file , lo ca t e a fre e b lo ck in t h e file s ys t e m , u p d a t e t h e d a t a b lo ck b it m a p in s id e t h e p ro p e r b lo ck g ro u p , s t o re t h e lo g ica l n u m b e r o f t h e n e w b lo ck e it h e r in t h e file 's in o d e o r in a n in d ire ct a d d re s s in g b lo ck, writ e t h e co n t e n t s o f t h e n e w b lo ck, a n d fin a lly, u p d a t e s e ve ra l fie ld s o f t h e in o d e . As yo u s e e , t h e a p p e n d o p e ra t io n t ra n s la t e s in t o m a n y lo we r- le ve l o p e ra t io n s o n t h e d a t a a n d m e t a d a t a b lo cks o f t h e file s ys t e m . No w, ju s t im a g in e wh a t co u ld h a p p e n if a s ys t e m fa ilu re o ccu rre d in t h e m id d le o f a n a p p e n d o p e ra t io n , wh e n s o m e o f t h e lo we r- le ve l m a n ip u la t io n s h a ve a lre a d y b e e n e xe cu t e d wh ile o t h e rs h a ve n o t . Of co u rs e , t h e s ce n a rio co u ld b e e ve n wo rs e , wit h h ig h - le ve l o p e ra t io n s a ffe ct in g t wo o r m o re file s ( fo r e xa m p le , m o vin g a file fro m o n e d ire ct o ry t o a n o t h e r) . To p re ve n t d a t a co rru p t io n , t h e Ext 3 file s ys t e m m u s t e n s u re t h a t e a ch s ys t e m ca ll is h a n d le d in a n a t o m ic wa y. An a t o m ic o p e ra t io n h a n d le is a s e t o f lo w- le ve l o p e ra t io n s o n t h e d is k d a t a s t ru ct u re s t h a t co rre s p o n d t o a s in g le h ig h - le ve l o p e ra t io n . Wh e n re co ve rin g fro m a s ys t e m fa ilu re , t h e file s ys t e m e n s u re s t h a t e it h e r t h e wh o le h ig h - le ve l o p e ra t io n is a p p lie d o r n o n e o f it s lo w- le ve l o p e ra t io n s is . An y a t o m ic o p e ra t io n h a n d le is re p re s e n t e d b y a d e s crip t o r o f t yp e handle_t. To s t a rt a n a t o m ic o p e ra t io n , t h e Ext 3 file s ys t e m in vo ke s t h e journal_start( ) JBD fu n ct io n , wh ich a llo ca t e s , if n e ce s s a ry, a n e w a t o m ic o p e ra t io n h a n d le a n d in s e rt s it in t o t h e cu rre n t t ra n s a ct io n s ( s e e t h e n e xt s e ct io n ) . S in ce a n y lo w- le ve l o p e ra t io n o n t h e d is k m ig h t s u s p e n d t h e p ro ce s s , t h e a d d re s s o f t h e a ct ive h a n d le is s t o re d in t h e journal_info fie ld o f t h e p ro ce s s d e s crip t o r. To n o t ify t h a t a n a t o m ic o p e ra t io n is co m p le t e d , t h e Ext 3 file s ys t e m in vo ke s t h e journal_stop( ) fu n ct io n .

17.7.3.3 Transactions

Fo r re a s o n s o f e fficie n cy, t h e JBD la ye r m a n a g e s t h e jo u rn a l b y g ro u p in g t h e lo g re co rd s t h a t b e lo n g t o s e ve ra l a t o m ic o p e ra t io n h a n d le s in t o a s in g le t ra n s a ct io n . Fu rt h e rm o re , a ll lo g re co rd s re la t ive t o a h a n d le m u s t b e in clu d e d in t h e s a m e t ra n s a ct io n . All lo g re co rd s o f a t ra n s a ct io n a re s t o re d in co n s e cu t ive b lo cks o f t h e jo u rn a l. Th e JBD la ye r h a n d le s e a ch t ra n s a ct io n a s a wh o le . Fo r in s t a n ce , it re cla im s t h e b lo cks u s e d b y a t ra n s a ct io n o n ly a ft e r a ll d a t a in clu d e d in it s lo g re co rd s is co m m it t e d t o t h e file s ys t e m . As s o o n a s it is cre a t e d , a t ra n s a ct io n m a y a cce p t lo g re co rd s o f n e w h a n d le s . Th e t ra n s a ct io n s t o p s a cce p t in g n e w h a n d le s wh e n e it h e r o f t h e fo llo win g o ccu rs : ● ●

A fixe d a m o u n t o f t im e h a s e la p s e d , t yp ica lly 5 s e co n d s . Th e re a re n o fre e b lo cks in t h e jo u rn a l le ft fo r a n e w h a n d le

A t ra n s a ct io n is re p re s e n t e d b y a d e s crip t o r o f t yp e transaction_t. Th e m o s t im p o rt a n t fie ld is t_state, wh ich d e s crib e s t h e cu rre n t s t a t u s o f t h e t ra n s a ct io n .

Es s e n t ia lly, a t ra n s a ct io n ca n b e : Co m p le t e All lo g re co rd s in clu d e d in t h e t ra n s a ct io n h a ve b e e n p h ys ica lly writ t e n o n t o t h e jo u rn a l. Wh e n re co ve rin g fro m a s ys t e m fa ilu re , e 2 fs ck co n s id e rs e ve ry co m p le t e t ra n s a ct io n o f t h e jo u rn a l a n d writ e s t h e co rre s p o n d in g b lo cks in t o t h e file s ys t e m . In t h is ca s e , t h e i_state fie ld s t o re s t h e va lu e T_FINISHED.

In co m p le t e At le a s t o n e lo g re co rd in clu d e d in t h e t ra n s a ct io n h a s n o t ye t b e e n p h ys ica lly writ t e n t o t h e jo u rn a l, o r n e w lo g re co rd s a re s t ill b e in g a d d e d t o t h e t ra n s a ct io n . In ca s e o f s ys t e m fa ilu re , t h e im a g e o f t h e t ra n s a ct io n s t o re d in t h e jo u rn a l is like ly n o t u p t o d a t e . Th e re fo re , wh e n re co ve rin g fro m a s ys t e m fa ilu re , e 2 fs ck d o e s n o t t ru s t t h e in co m p le t e t ra n s a ct io n s in t h e jo u rn a l a n d s kip s t h e m . In t h is ca s e , t h e i_state fie ld s t o re s o n e o f t h e fo llo win g va lu e s : T_RUNNING

S t ill a cce p t in g n e w a t o m ic o p e ra t io n h a n d le s .

T_LOCKED

No t a cce p t in g n e w a t o m ic o p e ra t io n h a n d le s , b u t s o m e o f t h e m a re s t ill u n fin is h e d .

T_FLUSH

All a t o m ic o p e ra t io n h a n d le s h a ve fin is h e d , b u t s o m e lo g re co rd s a re s t ill b e in g writ t e n t o t h e jo u rn a l.

T_COMMIT

All lo g re co rd s o f t h e a t o m ic o p e ra t io n h a n d le s h a ve b e e n writ t e n t o d is k, a n d t h e t ra n s a ct io n is m a rke d a s co m p le t e d o n t h e jo u rn a l.

At a n y g ive n in s t a n ce , t h e jo u rn a l m a y in clu d e s e ve ra l t ra n s a ct io n s . Ju s t o n e o f t h e m is in t h e T_RUNNING s t a t e — it is t h e a ct iv e t ra n s a ct io n t h a t is a cce p t in g t h e n e w a t o m ic o p e ra t io n h a n d le re q u e s t s is s u e d b y t h e Ext 3 file s ys t e m . S e ve ra l t ra n s a ct io n s in t h e jo u rn a l m ig h t b e in co m p le t e b e ca u s e t h e b u ffe rs co n t a in in g t h e re la t ive lo g re co rd s h a ve n o t ye t b e e n writ t e n t o t h e jo u rn a l. A co m p le t e t ra n s a ct io n is d e le t e d fro m t h e jo u rn a l o n ly wh e n t h e JBD la ye r ve rifie s t h a t a ll b u ffe rs d e s crib e d b y t h e lo g re co rd s h a ve b e e n s u cce s s fu lly writ t e n o n t o t h e Ext 3 file s ys t e m . Th e re fo re , t h e jo u rn a l ca n in clu d e a t m o s t o n e in co m p le t e t ra n s a ct io n a n d s e ve ra l co m p le t e t ra n s a ct io n s . Th e lo g re co rd s o f a co m p le t e t ra n s a ct io n h a ve b e e n writ t e n t o t h e jo u rn a l b u t s o m e o f t h e co rre s p o n d in g b u ffe rs h a ve ye t t o b e writ t e n o n t o t h e file s ys t e m .

17.7.4 How Journaling Works Le t 's t ry t o e xp la in h o w jo u rn a lin g wo rks wit h a n e xa m p le : t h e Ext 3 file s ys t e m la ye r re ce ive s a re q u e s t t o writ e s o m e d a t a b lo cks o f a re g u la r file . As yo u m ig h t e a s ily g u e s s , we a re n o t g o in g t o d e s crib e in d e t a il e ve ry s in g le o p e ra t io n o f t h e Ext 3 file s ys t e m la ye r a n d o f t h e JBD la ye r. Th e re wo u ld b e fa r t o o m a n y is s u e s t o b e co ve re d ! Ho we ve r, we d e s crib e t h e e s s e n t ia l a ct io n s : 1 . Th e s e rvice ro u t in e o f t h e write( ) s ys t e m ca ll t rig g e rs t h e write m e t h o d o f t h e file o b je ct a s s o cia t e d wit h t h e Ext 3 re g u la r file . Fo r Ext 3 , t h is m e t h o d is im p le m e n t e d b y t h e generic_file_write( ) fu n ct io n , a lre a d y d e s crib e d in S e ct io n 1 5 . 1 . 3 .

2 . Th e generic_file_write( ) fu n ct io n in vo ke s t h e prepare_write m e t h o d o f t h e

address_space o b je ct s e ve ra l t im e s , o n ce fo r e ve ry p a g e o f d a t a in vo lve d b y t h e writ e o p e ra t io n . Fo r Ext 3 , t h is m e t h o d is im p le m e n t e d b y t h e ext3_prepare_write( ) fu n ct io n .

3 . Th e ext3_prepare_write( ) fu n ct io n s t a rt s a n e w a t o m ic o p e ra t io n b y in vo kin g t h e journal_start( ) JBD fu n ct io n . Th e h a n d le is a d d e d t o t h e a ct ive t ra n s a ct io n . Act u a lly, t h e a t o m ic o p e ra t io n h a n d le is cre a t e d o n ly wh e n e xe cu t in g t h e firs t in vo ca t io n o f t h e journal_start( ) fu n ct io n . Fo llo win g in vo ca t io n s ve rify t h a t t h e

journal_info fie ld o f t h e p ro ce s s d e s crip t o r is a lre a d y s e t a n d u s e t h e re fe re n ce d h a n d le . 4 . Th e ext3_prepare_write( ) fu n ct io n in vo ke s t h e block_prepare_write( ) fu n ct io n a lre a d y d e s crib e d in Ch a p t e r 1 5 , p a s s in g t o it t h e a d d re s s o f t h e

ext3_get_block( ) fu n ct io n . Re m e m b e r t h a t block_prepare_write( ) t a ke s ca re o f p re p a rin g t h e b u ffe rs a n d t h e b u ffe r h e a d s o f t h e file 's p a g e . 5 . Wh e n t h e ke rn e l m u s t d e t e rm in e t h e lo g ica l n u m b e r o f a b lo ck o f t h e Ext 3 file s ys t e m , it e xe cu t e s t h e ext3_get_block( ) fu n ct io n . Th is fu n ct io n is a ct u a lly s im ila r t o ext2_get_block( ), wh ich is d e s crib e d in t h e e a rlie r s e ct io n S e ct io n 1 7 . 6 . 5 . A cru cia l d iffe re n ce , h o we ve r, is t h a t t h e Ext 3 file s ys t e m in vo ke s fu n ct io n s o f t h e JBD la ye r t o e n s u re t h a t t h e lo w- le ve l o p e ra t io n s a re lo g g e d :



Be fo re is s u in g a lo w- le ve l writ e o p e ra t io n o n a m e t a d a t a b lo ck o f t h e

file s ys t e m , t h e fu n ct io n in vo ke s journal_get_write_access( ). Ba s ica lly, t h is la t t e r fu n ct io n a d d s t h e m e t a d a t a b u ffe r t o a lis t o f t h e a ct ive t ra n s a ct io n . Ho we ve r, it m u s t a ls o ch e ck wh e t h e r t h e m e t a d a t a is in clu d e d in a n o ld e r in co m p le t e t ra n s a ct io n o f t h e jo u rn a l; in t h is ca s e , it d u p lica t e s t h e b u ffe r t o m a ke s u re t h a t t h e o ld e r t ra n s a ct io n s a re co m m it t e d wit h t h e o ld co n t e n t .



Aft e r u p d a t in g t h e b u ffe r co n t a in in g t h e m e t a d a t a b lo ck, t h e Ext 3 file s ys t e m in vo ke s journal_dirty_metadata( ) t o m o ve t h e m e t a d a t a b u ffe r t o t h e p ro p e r d irt y lis t o f t h e a ct ive t ra n s a ct io n a n d t o lo g t h e o p e ra t io n in t h e jo u rn a l.

No t ice t h a t m e t a d a t a b u ffe rs h a n d le d b y t h e JBD la ye r a re n o t u s u a lly in clu d e d in t h e d irt y lis t s o f b u ffe rs o f t h e in o d e , s o t h e y a re n o t writ t e n t o d is k b y t h e n o rm a l d is k ca ch e flu s h in g m e ch a n is m s d e s crib e d in Ch a p t e r 1 4 . 6 . If t h e Ext 3 file s ys t e m h a s b e e n m o u n t e d in "jo u rn a l" m o d e , t h e ext3_prepare_write( ) fu n ct io n a ls o in vo ke s journal_get_write_access( ) o n e ve ry b u ffe r t o u ch e d b y t h e writ e o p e ra t io n . 7 . Co n t ro l re t u rn s t o t h e generic_file_write( ) fu n ct io n , wh ich u p d a t e s t h e p a g e wit h t h e d a t a s t o re d in t h e Us e r Mo d e a d d re s s s p a ce a n d t h e n in vo ke s t h e commit_write m e t h o d o f t h e address_space o b je ct . Fo r Ext 3 , t h is m e t h o d is im p le m e n t e d b y t h e ext3_commit_write( ) fu n ct io n .

8 . If t h e Ext 3 file s ys t e m h a s b e e n m o u n t e d in "jo u rn a l" m o d e , t h e ext3_commit_write( ) fu n ct io n in vo ke s journal_dirty_metadata( ) o n e ve ry b u ffe r o f d a t a ( n o t m e t a d a t a ) in t h e p a g e . Th is wa y, t h e b u ffe r is in clu d e d in t h e p ro p e r d irt y lis t o f t h e a ct ive t ra n s a ct io n a n d n o t in t h e d irt y lis t o f t h e o wn e r in o d e ; m o re o ve r, t h e co rre s p o n d in g lo g re co rd s a re writ t e n t o t h e jo u rn a l. 9 . If t h e Ext 3 file s ys t e m h a s b e e n m o u n t e d in "o rd e re d " m o d e , t h e ext3_commit_write( ) fu n ct io n in vo ke s t h e journal_dirty_data( ) fu n ct io n o n e ve ry b u ffe r o f d a t a in t h e p a g e t o in s e rt t h e b u ffe r in a p ro p e r lis t o f t h e a ct ive t ra n s a ct io n s . Th e JBD la ye r e n s u re s t h a t a ll b u ffe rs in t h is lis t a re writ t e n t o d is k b e fo re t h e m e t a d a t a b u ffe rs o f t h e t ra n s a ct io n . No lo g re co rd is writ t e n o n t o t h e jo u rn a l. 1 0 . If t h e Ext 3 file s ys t e m h a s b e e n m o u n t e d in "o rd e re d " o r "writ e b a ck" m o d e , t h e ext3_commit_write( ) fu n ct io n e xe cu t e s t h e n o rm a l generic_commit_write(

) fu n ct io n d e s crib e d in Ch a p t e r 1 5 , wh ich in s e rt s t h e d a t a b u ffe rs in t h e lis t o f t h e d irt y b u ffe rs o f t h e o wn e r in o d e . 1 1 . Fin a lly, ext3_commit_write( ) in vo ke s journal_stop( ) t o n o t ify t h e JBD la ye r t h a t t h e a t o m ic o p e ra t io n h a n d le is clo s e d . 1 2 . Th e s e rvice ro u t in e o f t h e write( ) s ys t e m ca ll t e rm in a t e s h e re . Ho we ve r, t h e JBD la ye r h a s n o t fin is h e d it s wo rk. Eve n t u a lly, o u r t ra n s a ct io n b e co m e s co m p le t e wh e n a ll it s lo g re co rd s h a ve b e e n p h ys ica lly writ t e n t o t h e jo u rn a l. Th e n journal_commit_transaction( ) is e xe cu t e d .

1 3 . If t h e Ext 3 file s ys t e m h a s b e e n m o u n t e d in "o rd e re d " m o d e , t h e journal_commit_transaction( ) fu n ct io n a ct iva t e s t h e I/ O d a t a t ra n s fe rs fo r a ll d a t a b u ffe rs in clu d e d in t h e lis t o f t h e t ra n s a ct io n a n d wa it s u n t il a ll d a t a t ra n s fe rs t e rm in a t e . 1 4 . Th e journal_commit_transaction( ) fu n ct io n a ct iva t e s t h e I/ O d a t a t ra n s fe rs fo r a ll m e t a d a t a b u ffe rs in clu d e d in t h e t ra n s a ct io n ( a n d a ls o fo r a ll d a t a b u ffe rs , if Ext 3 wa s m o u n t e d in "jo u rn a l" m o d e ) . 1 5 . Pe rio d ica lly, t h e ke rn e l a ct iva t e s a ch e ckp o in t a ct ivit y fo r e ve ry co m p le t e t ra n s a ct io n in t h e jo u rn a l. Th e ch e ckp o in t b a s ica lly in vo lve s ve rifyin g wh e t h e r t h e I/ O d a t a t ra n s fe rs t rig g e re d b y journal_commit_transaction( ) h a ve s u cce s s fu lly t e rm in a t e d . If s o , t h e t ra n s a ct io n ca n b e d e le t e d fro m t h e jo u rn a l. Of co u rs e , t h e lo g re co rd s in t h e jo u rn a l n e ve r p la y a n a ct ive ro le u n t il a s ys t e m fa ilu re o ccu rs . On ly in t h is ca s e , in fa ct , d o e s t h e e 2 fs ck u t ilit y p ro g ra m s ca n t h e jo u rn a l s t o re d in t h e file s ys t e m a n d re s ch e d u le a ll writ e o p e ra t io n s d e s crib e d b y t h e lo g re co rd s o f t h e co m p le t e t ra n s a ct io n s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 18. Networking Th e Lin u x ke rn e l s u p p o rt s m a n y d iffe re n t n e t wo rk a rch it e ct u re s ( TCP/ IP b e in g ju s t o n e o f t h e m ) , im p le m e n t s s e ve ra l a lt e rn a t ive a lg o rit h m s fo r s ch e d u lin g t h e n e t wo rk p a cke t s , a n d in clu d e s p ro g ra m s t h a t m a ke it e a s y fo r s ys t e m a d m in is t ra t o rs t o s e t u p ro u t e rs , g a t e wa ys , fire wa lls , a n d e ve n a s im p le Wo rld Wid e We b s e rve r, d ire ct ly a t t h e ke rn e l le ve l. Th e cu rre n t co d e , in s p ire d fro m t h e o rig in a l Be rke le y Un ix im p le m e n t a t io n , is re fe rre d t o a s Ne t - 4 . As t h e n a m e s u g g e s t s , it is t h e fo u rt h m a jo r ve rs io n o f Lin u x n e t wo rkin g . S im ila r t o VFS , t h e co d e u s e s o b je ct s t o p ro vid e a co m m o n in t e rfa ce t o t h e la rg e n u m b e r o f a va ila b le a rch it e ct u re s . Ho we ve r, co n t ra ry t o VFS , t h e n e t wo rkin g co d e is o rg a n ize d in t o la ye rs , e a ch o f wh ich h a s a we ll- d e fin e d in t e rfa ce wit h t h e a d ja ce n t la ye rs . S in ce d a t a t ra n s m it t e d a lo n g t h e n e t wo rk is n o t re u s a b le , t h e re is n o n e e d t o ca ch e it . Fo r t h e s a ke o f e fficie n cy, Lin u x a vo id s co p yin g t h e d a t a a cro s s la ye rs ; t h e o rig in a l d a t a is s t o re d in a m e m o ry b u ffe r, wh ich is la rg e e n o u g h t o co n t a in t h e co n t ro l in fo rm a t io n re q u e s t e d b y e a ch la ye r. Pa ckin g a d e t a ile d d e s crip t io n o f t h e Lin u x n e t wo rkin g co d e in a s in g le ch a p t e r o f a b o o k wo u ld b e a t ru ly im p o s s ib le m is s io n . In fa ct , n e a rly 2 0 p e rce n t o f a ll ke rn e l s o u rce co d e is d e vo t e d t o n e t wo rkin g . Th e re fo re , we co u ld n 't e ve n s u cce e d , wit h in t h e s p a ce co n s t ra in t s o f a s in g le ch a p t e r, in m e n t io n in g t h e n a m e s o f a ll t h e fe a t u re s , co m p o n e n t s , a n d d a t a s t ru ct u re s o f t h e Lin u x n e t wo rk s u b s ys t e m . Ou r o b je ct ive is m o re lim it e d . We co n ce n t ra t e o n t h e we ll- kn o wn TCP/ IP s t a ck o f p ro t o co ls a n d co n s id e r o n ly t h e d a t a lin k la ye r, t h e n e t wo rk la ye r, a n d t h e t ra n s p o rt la ye r. Fu rt h e rm o re , fo r t h e s a ke o f s im p licit y, we fo cu s o u r a t t e n t io n o n t h e UDP p ro t o co l a n d a t t e m p t t o g ive a s u ccin ct d e s crip t io n o f h o w t h e ke rn e l s u cce e d s in s e n d in g o r re ce ivin g a s in g le d a t a g ra m . Fin a lly, we a s s u m e t h a t o u r co m p u t e r is co n n e ct e d t o a lo ca l a re a n e t wo rk b y m e a n s o f a n Et h e rn e t ca rd . Th e firs t s e ct io n o f t h e ch a p t e r co ve rs t h e m a in d a t a s t ru ct u re s u s e d b y Lin u x n e t wo rkin g , wh ile t h e s e co n d o n e illu s t ra t e s t h e s ys t e m ca lls n e e d e d t o s e n d o r re ce ive a s in g le d a t a g ra m a n d d e s crib e s s ke t ch ily t h e co rre s p o n d in g s e rvice ro u t in e s . Th e la s t t wo s e ct io n s d e s crib e h o w t h e ke rn e l in t e ra ct s wit h t h e n e t wo rk ca rd t o s e n d o r re ce ive a p a cke t . We a s s u m e t h a t yo u a lre a d y h a ve s o m e b a ckg ro u n d in n e t wo rk p ro t o co ls , la ye rs , a n d a p p lica t io n s . Th e re a re m a n y g o o d b o o ks o n t h e s e t o p ics , s o m e o f wh ich a re lis t e d in t h e Bib lio g ra p h y a t t h e e n d o f t h is b o o k. On e fin a l re m a rk: writ in g p ro g ra m s fo r t h e n e t wo rk s u b s ys t e m is q u it e a h a rd t a s k. Wh ile yo u h a ve t o s t ick t o t h e d o cu m e n t e d s t a n d a rd s , fo llo win g t h e m is n o t e n o u g h b e ca u s e t h e y d o n o t s p e cify t h e s m a lle s t , m o s t cu m b e rs o m e d e t a ils o f t h e p ro t o co ls . Th u s yo u h a ve t o t a ke in t o a cco u n t t h e im p le m e n t a t io n s o f t h e a lre a d y e xis t in g n e t wo rk p ro g ra m s , e ve n t h o s e in o t h e r o p e ra t in g s ys t e m s ( b u g s in clu d e d ) . An d , o f co u rs e , yo u m u s t writ e fa s t a n d e fficie n t p ro g ra m s ; o t h e rwis e yo u r s e rve r will n o t ke e p u p wit h t h e h ig h e s t n e t wo rk lo a d s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

18.1 Main Networking Data Structures In t h is s e ct io n , we s h a ll g ive a g e n e ra l id e a o f h o w Lin u x im p le m e n t s t h e lo we r la ye rs o f n e t wo rkin g .

18.1.1 Network Architectures A n e t w o rk a rch it e ct u re d e s crib e h o w a s p e cific co m p u t e r n e t wo rk is m a d e . Th e a rch it e ct u re d e fin e s a s e t o f la y e rs , e a ch o f wh ich s h o u ld h a ve a we ll- d e fin e d p u rp o s e ; p ro g ra m s in e a ch la ye r co m m u n ica t e b y u s in g a s h a re d s e t o f ru le s a n d co n ve n t io n s ( a s o - ca lle d p ro t o co l) . Ge n e ra lly s p e a kin g , Lin u x s u p p o rt s a la rg e n u m b e r o f d iffe re n t n e t wo rk a rch it e ct u re s ; s o m e o f t h e m a re lis t e d in Ta b le 1 8 - 1 .

Ta b le 1 8 - 1 . S o m e n e t w o rk a rc h it e c t u re s s u p p o rt e d b y Lin u x

Na m e

N e t w o rk a rc h it e c t u re a n d / o r p ro t o c o l fa m ily

PF_APPLETALK

Ap p le t a lk

PF_BLUETOOTH

Blu e t o o t h

PF_BRIDGE

Mu lt ip ro t o co l b rid g e

PF_DECnet

DECn e t

PF_INET

IPS 's IPv4 p ro t o co l

PF_INET6

IPS 's IPv6 p ro t o co l

PF_IPX

No ve ll IPX

PF_LOCAL, PF_UNIX

Un ix d o m a in s o cke t s ( lo ca l co m m u n ica t io n )

PF_PACKET

IPS 's IPv4 / IPv6 p ro t o co l lo w- le ve l a cce s s

PF_X25

X2 5

IPS ( In t e rn e t Pro t o co l S u it e ) is t h e n e t wo rk a rch it e ct u re o f In t e rn e t , t h e we ll- kn o wn in t e rn e t wo rk t h a t co lle ct s h u n d re d s o f t h o u s a n d s o f lo ca l co m p u t e r n e t wo rks a ll a ro u n d t h e wo rld . S o m e t im e s it is a ls o ca lle d TCP/ IP n e t w o rk a rch it e ct u re fro m t h e n a m e s o f t h e t wo m a in p ro t o co ls t h a t it d e fin e s .

18.1.2 Network Interface Cards A n e t w o rk in t e rfa ce ca rd ( NIC) is a s p e cia l kin d o f I/ O d e vice t h a t d o e s n o t h a ve a co rre s p o n d in g d e vice file . Es s e n t ia lly, a n e t wo rk ca rd p la ce s o u t g o in g d a t a o n a lin e g o in g t o re m o t e co m p u t e r s ys t e m s a n d re ce ive s p a cke t s fro m t h o s e s ys t e m s in t o ke rn e l m e m o ry.

S t a rt in g wit h BS D, a ll Un ix s ys t e m s a s s ig n a d iffe re n t s ym b o lic n a m e t o e a ch n e t wo rk ca rd in clu d e d in t h e co m p u t e r; fo r in s t a n ce , t h e firs t Et h e rn e t ca rd g e t s t h e eth0 n a m e . Ho we ve r, t h e n a m e d o e s n o t co rre s p o n d t o a n y d e vice file a n d h a s n o co rre s p o n d in g in o d e in t h e s ys t e m d ire ct o ry t re e . In s t e a d o f u s in g t h e file s ys t e m , t h e s ys t e m a d m in is t ra t o r h a s t o s e t u p a re la t io n s h ip b e t we e n t h e d e vice n a m e a n d a n e t wo rk a d d re s s . Th e re fo re , a s we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 1 8 . 2 , BS D Un ix in t ro d u ce d a n e w g ro u p o f s ys t e m ca lls , wh ich h a s b e co m e t h e s t a n d a rd p ro g ra m m in g m o d e l fo r n e t wo rk d e vice s .

18.1.3 BSD Sockets Ge n e ra lly s p e a kin g , a n y o p e ra t in g s ys t e m m u s t d e fin e a co m m o n Ap p lica t io n Pro g ra m m in g In t e rfa ce ( API) b e t we e n t h e Us e r Mo d e p ro g ra m a n d t h e n e t wo rkin g co d e . Th e Lin u x n e t wo rkin g API is b a s e d o n BS D s o ck e t s . Th e y we re in t ro d u ce d in Be rke le y's Un ix 4 . 1 cBS D a n d a re a va ila b le in a lm o s t a ll Un ix- like o p e ra t in g s ys t e m s , e it h e r n a t ive ly o r b y m e a n s o f a Us e r Mo d e h e lp e r lib ra ry. [ 1 ] [1]

An a lt e rn a t ive API b e t we e n Us e r Mo d e p ro g ra m s a n d n e t wo rkin g co d e is p ro vid e d b y t h e Tra n s p o rt La ye r In t e rfa ce ( TLI) , in t ro d u ce d b y S ys t e m V Re le a s e 3 . 0 . In g e n e ra l, TLI is im p le m e n t e d a s a Us e r Mo d e lib ra ry t h a t u s e s t h e S TREAMS I/ O s u b s ys t e m . As m e n t io n e d in S e ct io n 1 . 1 , t h e Lin u x ke rn e l d o e s n o t im p le m e n t t h e S TREAMS I/ O s u b s ys t e m .

A s o ck e t is a co m m u n ica t io n e n d p o in t — t h e t e rm in a l g a t e o f a ch a n n e l t h a t lin ks t wo p ro ce s s e s . Da t a is p u s h e d o n a t e rm in a l g a t e , a n d a ft e r s o m e d e la y, s h o ws u p a t t h e o t h e r g a t e . Th e co m m u n ica t in g p ro ce s s e s m a y b e o n d iffe re n t co m p u t e rs ; it 's u p t o t h e ke rn e l's n e t wo rkin g co d e t o fo rwa rd t h e d a t a b e t we e n t h e t wo e n d p o in t s . Lin u x im p le m e n t s BS D s o cke t s a s file s t h a t b e lo n g t o t h e s o ck fs s p e cia l file s ys t e m ( s e e S e ct io n 1 2 . 3 . 1 ) . Mo re p re cis e ly, fo r e ve ry n e w BS D s o cke t , t h e ke rn e l cre a t e s a n e w in o d e in t h e s o ck fs s p e cia l file s ys t e m . Th e a t t rib u t e s o f t h e BS D s o cke t a re s t o re d in a socket d a t a s t ru ct u re , wh ich is a n o b je ct in clu d e d in t h e file s ys t e m - s p e cific u.socket_i fie ld o f t h e s o ck fs 's in o d e .

Th e m o s t im p o rt a n t fie ld s o f t h e BS D s o cke t o b je ct a re :

inode Po in t s t o t h e s o ck fs 's in o d e o b je ct

file Po in t s t o t h e file o b je ct o f t h e s o ck fs 's file

state S t o re s t h e co n n e ct io n s t a t u s o f t h e s o cke t : SS_FREE ( n o t a llo ca t e d ) , SS_UNCONNECTED ( n o t ye t co n n e ct e d ) , SS_CONNECTING ( in p ro ce s s o f co n n e ct in g ) , SS_CONNECTED ( co n n e ct e d ) ,

SS_DISCONNECTING ( in p ro ce s s o f d is co n n e ct in g ) . ops Po in t s t o a proto_ops d a t a s t ru ct u re , wh ich s t o re s t h e m e t h o d s o f t h e socket o b je ct ; t h e y a re lis t e d in Ta b le 1 8 - 2 . Mo s t o f t h e m e t h o d s re fe r t o s ys t e m ca lls t h a t o p e ra t e o n s o cke t s . Ea ch n e t wo rk a rch it e ct u re im p le m e n t s t h e m e t h o d s b y m e a n s o f it s o wn fu n ct io n s ; h e n ce , t h e s a m e s ys t e m ca ll a ct s d iffe re n t ly a cco rd in g t o t h e n e t wo rkin g a rch it e ct u re t o wh ich t h e t a rg e t

s o cke t b e lo n g s .

Ta b le 1 8 - 2 . Th e m e t h o d s o f t h e BS D s o c k e t o b je c t

Me t h o d

D e s c rip t io n

release

Clo s e t h e s o cke t

bind

As s ig n a lo ca l a d d re s s ( a n a m e )

connect

Eit h e r e s t a b lis h a co n n e ct io n ( TCP) o r a s s ig n a re m o t e a d d re s s ( UDP)

socketpair

Cre a t e a p a ir o f s o cke t s fo r t wo - wa y d a t a flo w

accept

Wa it fo r co n n e ct io n re q u e s t s

getname

Re t u rn t h e lo ca l a d d re s s

ioctl

Im p le m e n t ioctl( )'s co m m a n d s

listen

In it ia lize t h e s o cke t t o a cce p t co n n e ct io n re q u e s t s

shutdown

Clo s e a h a lf o r b o t h h a lve s o f a fu ll- d u p le x co n n e ct io n

setsockopt

S e t t h e va lu e o f t h e s o cke t fla g s

getsockopt

Ge t t h e va lu e o f t h e s o cke t fla g s

sendmsg

S e n d a p a cke t o n t h e s o cke t

recvmsg

Re ce ive a p a cke t fro m t h e s o cke t

mmap

File m e m o ry- m a p p in g ( n o t u s e d b y n e t wo rk s o cke t s )

sendpage

Co p y d a t a d ire ct ly fro m / t o a file ( sendfile( ) s ys t e m ca ll)

sk Po in t s t o t h e lo w- le ve l struct sock s o cke t d e s crip t o r ( s e e t h e n e xt s e ct io n ) .

18.1.4 INET Sockets INET s o ck e t s a re d a t a s t ru ct u re s o f t yp e struct sock. An y BS D s o cke t t h a t b e lo n g s t o t h e IPS n e t wo rk a rch it e ct u re s t o re s t h e a d d re s s o f a n INET s o cke t in t h e sk fie ld o f t h e socket o b je ct .

INET s o cke t s a re re q u ire d b e ca u s e t h e socket o b je ct s ( d e s crib in g BS D s o cke t s ) in clu d e o n ly fie ld s t h a t a re m e a n in g fu l t o a ll n e t wo rk a rch it e ct u re s . Ho we ve r, t h e ke rn e l m u s t a ls o ke e p t ra ck o f s e ve ra l o t h e r b it s o f in fo rm a t io n fo r a n y s o cke t o f e ve ry s p e cific n e t wo rk a rch it e ct u re . Fo r in s t a n ce , in e a ch INET s o cke t , t h e ke rn e l re co rd s t h e lo ca l a n d re m o t e IP a d d re s s e s , t h e lo ca l a n d re m o t e p o rt n u m b e rs , t h e re la t ive t ra n s p o rt p ro t o co l, t h e q u e u e o f p a cke t s t h a t we re re ce ive d fro m t h e s o cke t , t h e q u e u e o f p a cke t s wa it in g t o b e s e n t t o t h e s o cke t , a n d s e ve ra l t a b le s o f m e t h o d s t h a t h a n d le t h e p a cke t s t ra ve lin g o n t h e s o cke t . Th e s e a t t rib u t e s a re s t o re d , t o g e t h e r wit h m a n y o t h e rs , in t h e INET s o cke t . Th e INET s o cke t o b je ct a ls o d e fin e s s o m e m e t h o d s s p e cific t o t h e t yp e o f t ra n s p o rt p ro t o co l a d o p t e d ( TCP o r UDP) . Th e m e t h o d s a re s t o re d in a d a t a s t ru ct u re o f t yp e proto a n d a re lis t e d in Ta b le 1 8 - 3 .

Ta b le 1 8 - 3 . Th e m e t h o d s o f t h e I N ET s o c k e t o b je c t

Me t h o d

D e s c rip t io n

close

Clo s e t h e s o cke t

connect

Eit h e r e s t a b lis h a co n n e ct io n o r a s s ig n a re m o t e a d d re s s

disconnect

Re lin q u is h a n e s t a b lis h e d co n n e ct io n

accept

Wa it fo r co n n e ct io n re q u e s t

ioctl

Im p le m e n t ioctl( )'s co m m a n d s

init

INET s o cke t o b je ct co n s t ru ct o r

destroy

INET s o cke t o b je ct d e s t ru ct o r

shutdown

Clo s e a h a lf o r b o t h h a lve s o f a fu ll- d u p le x co n n e ct io n

setsockopt

S e t t h e va lu e o f t h e s o cke t fla g s

getsockopt

Ge t t h e va lu e o f t h e s o cke t fla g s

sendmsg

S e n d a p a cke t o n t h e s o cke t

recvmsg

Re ce ive a p a cke t fro m t h e s o cke t

bind

As s ig n a lo ca l a d d re s s ( a n a m e )

backlog_rcv

Ca llb a ck fu n ct io n in vo ke d wh e n re ce ivin g a p a cke t

hash

Ad d t h e INET s o cke t t o t h e p e r- p ro t o co l h a s h t a b le

unhash

Re m o ve t h e INET s o cke t fro m t h e p e r- p ro t o co l h a s h t a b le

get_port

As s ig n a p o rt n u m b e r t o t h e INET s o cke t

As yo u m a y n o t ice , m a n y m e t h o d s re p lica t e t h e m e t h o d s o f t h e BS D s o cke t o b je ct ( Ta b le 1 8 - 2 ) . Act u a lly, a BS D s o cke t m e t h o d u s u a lly in vo ke s t h e co rre s p o n d in g INET s o cke t m e t h o d , if it is d e fin e d . Th e sock o b je ct in clu d e s n o le s s t h a n 8 0 fie ld s ; m a n y o f t h e m a re p o in t e rs t o o t h e r o b je ct s , t a b le s o f m e t h o d s , o r o t h e r d a t a s t ru ct u re s t h a t d e s e rve a d e t a ile d d e s crip t io n b y t h e m s e lve s . Ra t h e r t h a n in clu d in g a b o rin g lis t o f fie ld n a m e s , we in t ro d u ce a fe w fie ld s o f t h e sock o b je ct wh e n e ve r we e n co u n t e r t h e m in t h e re s t o f t h e ch a p t e r.

18.1.5 The Destination Cache As we s h a ll s e e in t h e la t e r s e ct io n S e ct io n 1 8 . 2 . 2 , p ro ce s s e s u s u a lly "a s s ig n n a m e s " t o s o cke t s — t h a t is , t h e y s p e cify t h e re m o t e IP a d d re s s a n d p o rt n u m b e r o f t h e h o s t t h a t s h o u ld re ce ive t h e d a t a writ t e n o n t o t h e s o cke t . Th e ke rn e l s h a ll a ls o m a ke a va ila b le t o t h e p ro ce s s e s re a d in g t h e s o cke t s e ve ry p a cke t re ce ive d fro m t h e re m o t e h o s t ca rryin g t h e p ro p e r p o rt n u m b e r. Act u a lly, t h e ke rn e l h a s t o ke e p in m e m o ry a b u n ch o f d a t a a b o u t t h e re m o t e h o s t id e n t ifie d b y a n in u s e s o cke t . To s p e e d u p t h e n e t wo rkin g co d e , t h is d a t a is s t o re d in a s o - ca lle d d e s t in a t io n ca ch e , wh o s e e n t rie s a re o b je ct s o f t yp e dst_entry. Ea ch INET s o cke t s t o re s in t h e dst_cache fie ld a p o in t e r t o a s in g le dst_entry o b je ct , wh ich co rre s p o n d s t o t h e d e s t in a t io n h o s t b o u n d t o t h e s o cke t .

A dst_entry o b je ct s t o re s a lo t o f d a t a u s e d b y t h e ke rn e l wh e n e ve r it s e n d s a p a cke t t o t h e co rre s p o n d in g re m o t e h o s t . Fo r in s t a n ce , it in clu d e s : ●

A p o in t e r t o a net_device o b je ct d e s crib in g t h e n e t wo rk d e vice ( fo r in s t a n ce , a n e t wo rk ca rd )



t h a t t ra n s m it s o r re ce ive s t h e p a cke t s A p o in t e r t o a neighbour s t ru ct u re re la t ive t o t h e ro u t e r t h a t fo rwa rd s t h e p a cke t s t o t h e ir fin a l d e s t in a t io n , if a n y ( s e e t h e la t e r s e ct io n S e ct io n 1 8 . 1 . 6 . 3 )



A p o in t e r t o a hh_cache s t ru ct u re , wh ich d e s crib e s t h e co m m o n h e a d e r t o b e a t t a ch e d t o



e ve ry p a cke t t o b e t ra n s m it t e d ( s e e t h e la t e r s e ct io n S e ct io n 1 8 . 1 . 6 . 3 ) Th e p o in t e r t o a fu n ct io n in vo ke d wh e n e ve r a p a cke t is re ce ive d fro m t h e re m o t e h o s t Th e p o in t e r t o a fu n ct io n in vo ke d wh e n e ve r a p a cke t is t o b e t ra n s m it t e d



18.1.6 Routing Data Structures Th e m o s t im p o rt a n t fu n ct io n o f t h e IP la ye r co n s is t s o f e n s u rin g t h a t p a cke t s o rig in a t e d b y t h e h o s t o r re ce ive d b y t h e n e t wo rk in t e rfa ce ca rd s a re fo rwa rd e d t o wa rd t h e ir fin a l d e s t in a t io n s . As yo u m ig h t e a s ily g u e s s , t h is t a s k is re a lly cru cia l b e ca u s e t h e ro u t in g a lg o rit h m s h o u ld b e fa s t e n o u g h t o ke e p u p wit h t h e h ig h e s t n e t wo rk lo a d s . Th e IP ro u t in g m e ch a n is m is fa irly s im p le . Ea ch 3 2 - b it in t e g e r re p re s e n t in g a n IP a d d re s s e n co d e s b o t h a n e t w o rk a d d re s s , wh ich s p e cifie s t h e n e t wo rk t h e h o s t is in , a n d a h o s t id e n t ifie r, wh ich s p e cifie s t h e h o s t in s id e t h e n e t wo rk. To p ro p e rly in t e rp re t t h e IP a d d re s s , t h e ke rn e l m u s t kn o w t h e n e t w o rk m a s k o f a g ive n IP a d d re s s — t h a t is , wh a t b it s o f t h e IP a d d re s s e n co d e t h e n e t wo rk a d d re s s . Fo r in s t a n ce , s u p p o s e t h e n e t wo rk m a s k o f t h e IP a d d re s s 1 9 2 . 1 6 0 . 8 0 . 1 1 0 is 2 5 5 . 2 5 5 . 2 5 5 . 0 ; t h e n 1 9 2 . 1 6 0 . 8 0 . 0 re p re s e n t s t h e n e t wo rk a d d re s s , wh ile 1 1 0 id e n t ifie s t h e h o s t in s id e it s n e t wo rk. No wa d a ys , t h e n e t wo rk a d d re s s is a lm o s t a lwa ys s t o re d in t h e m o s t s ig n ifica n t b it s o f t h e IP a d d re s s , s o e a ch n e t wo rk m a s k ca n a ls o b e re p re s e n t e d b y t h e n u m b e r o f b it s s e t t o 1 ( 2 4 in o u r e xa m p le ) . Th e ke y p ro p e rt y o f IP ro u t in g is t h a t a n y h o s t in t h e in t e rn e t wo rk n e e d s o n ly t o kn o w t h e a d d re s s o f a co m p u t e r in s id e it s lo ca l a re a n e t wo rk ( a s o - ca lle d ro u t e r) , wh ich is a b le t o fo rwa rd t h e p a cke t s t o t h e d e s t in a t io n n e t wo rk. Fo r in s t a n ce , co n s id e r t h e fo llo win g ro u t in g t a b le s h o wn b y t h e n e t s t a t - rn s ys t e m co m m a n d :

Destination 192.160.80.0 192.160.0.0 192.160.50.0 0.0.0.0

Gateway 0.0.0.0 0.0.0.0 192.160.11.1 192.160.1.1

Genmask 255.255.255.0 255.255.0.0 255.255.0.0 0.0.0.0

Flags U U UG UG

MSS 40 40 40 40

Window 0 0 0 0

irtt 0 0 0 0

Iface eth1 eth0 eth0 eth0

Th is co m p u t e r is lin ke d t o t wo n e t wo rks . On e o f t h e m h a s t h e IP a d d re s s 1 9 2 . 1 6 0 . 8 0 . 0 a n d a n e t m a s k o f 2 4 b it s , a n d it is s e rve d b y t h e Ne t wo rk In t e rfa ce Ca rd ( NIC) a s s o cia t e d wit h t h e n e t wo rk d e vice e t h 1 . Th e o t h e r n e t wo rk h a s t h e IP a d d re s s 1 9 2 . 1 6 0 . 0 . 0 a n d a n e t m a s k o f 1 6 b it s , a n d it is s e rve d b y t h e NIC a s s o cia t e d wit h e t h 0 . S u p p o s e t h a t a p a cke t m u s t b e s e n t t o a h o s t t h a t b e lo n g s t o t h e lo ca l a re a n e t wo rk 1 9 2 . 1 6 0 . 8 0 . 0 a n d t h a t h a s t h e IP a d d re s s 1 9 2 . 1 6 0 . 8 0 . 1 1 0 . Th e ke rn e l e xa m in e s t h e s t a t ic ro u t in g t a b le s t a rt in g wit h t h e h ig h e r e n t ry ( t h e o n e in clu d in g t h e g re a t e r n u m b e r o f b it s s e t t o 1 in t h e n e t m a s k) . Fo r e a ch e n t ry, it p e rfo rm s a lo g ica l AND b e t we e n t h e d e s t in a t io n h o s t 's IP a d d re s s a n d t h e n e t m a s k; if t h e re s u lt s a re e q u a l t o t h e n e t wo rk d e s t in a t io n a d d re s s , t h e ke rn e l u s e s t h e e n t ry t o ro u t e t h e p a cke t . In o u r ca s e , t h e firs t e n t ry win s a n d t h e p a cke t is s e n t t o t h e e t h 1 n e t wo rk d e vice . In t h is ca s e , t h e "g a t e wa y" fie ld o f t h e s t a t ic ro u t in g t a b le e n t ry is n u ll ( "0 . 0 . 0 . 0 ") . Th is m e a n s t h e a d d re s s is o n t h e lo ca l n e t wo rk o f t h e s e n d e r, s o t h e co m p u t e r s e n d s p a cke t s d ire ct ly t o h o s t s in t h e n e t wo rk; it e n ca p s u la t e s t h e p a cke t in a fra m e ca rryin g t h e Et h e rn e t a d d re s s o f t h e d e s t in a t io n h o s t . Th e fra m e is p h ys ica lly b ro a d ca s t t o a ll h o s t s in t h e n e t wo rk, b u t a n y NIC a u t o m a t ica lly ig n o re s fra m e s ca rryin g Et h e rn e t a d d re s s e s d iffe re n t fro m it s o wn . S u p p o s e n o w t h a t a p a cke t m u s t b e s e n t t o a h o s t t h a t h a s t h e IP a d d re s s 2 0 9 . 2 0 4 . 1 4 6 . 2 2 . Th is a d d re s s b e lo n g s t o a re m o t e n e t wo rk ( n o t d ire ct ly lin ke d t o o u r co m p u t e r) . Th e la s t e n t ry in t h e t a b le is a ca t ch - a ll e n t ry, s in ce t h e AND lo g ica l o p e ra t io n wit h t h e n e t m a s k 0 . 0 . 0 . 0 a lwa ys yie ld s t h e n e t wo rk a d d re s s 0 . 0 . 0 . 0 . Th u s , in o u r ca s e , a n y IP a d d re s s s t ill n o t re s o lve d b y h ig h e r e n t rie s is s e n t t h ro u g h t h e e t h 0 n e t wo rk d e vice t o t h e d e fa u lt ro u t e r t h a t h a s t h e IP a d d re s s 1 9 2 . 1 6 0 . 1 . 1 , wh ich h o p e fu lly kn o ws h o w t o fo rwa rd t h e p a cke t t o wa rd it s fin a l d e s t in a t io n . Th e p a cke t is e n ca p s u la t e d in a fra m e ca rryin g t h e Et h e rn e t a d d re s s o f t h e d e fa u lt ro u t e r.

18.1.6.1 The Forwarding Information Base (FIB) Th e Fo rw a rd in g In fo rm a t io n Ba s e ( FIB) , o r s t a t ic ro u t in g t a b le , is t h e u lt im a t e re fe re n ce u s e d b y t h e ke rn e l t o d e t e rm in e h o w t o fo rwa rd p a cke t s t o t h e ir d e s t in a t io n s . As a m a t t e r o f fa ct , if t h e d e s t in a t io n n e t wo rk o f a p a cke t is n o t in clu d e d in t h e FIB, t h e n t h e ke rn e l ca n n o t t ra n s m it t h a t p a cke t . As m e n t io n e d p re vio u s ly, h o we ve r, t h e FIB u s u a lly in clu d e s a d e fa u lt e n t ry t h a t ca t ch e s a n y IP a d d re s s n o t re s o lve d b y t h e o t h e r e n t rie s . Th e ke rn e l d a t a s t ru ct u re s t h a t im p le m e n t t h e FIB a re q u it e s o p h is t ica t e d . In fa ct , ro u t e rs m ig h t in clu d e s e ve ra l h u n d re d lin e s , m o s t o f wh ich re fe r t o t h e s a m e n e t wo rk d e vice s o r t o t h e s a m e g a t e wa y. Fig u re 1 8 - 1 illu s t ra t e s a s im p lifie d vie w o f t h e FIB's d a t a s t ru ct u re s wh e n t h e t a b le in clu d e s t h e fo u r e n t rie s o f t h e ro u t in g t a b le ju s t s h o wn . Yo u ca n g e t a lo w- le ve l vie w o f t h e d a t a in clu d e d in t h e FIB d a t a s t ru ct u re s b y re a d in g t h e / p ro c/ n e t / ro u t e file . Fig u re 1 8 - 1 . FI B's m a in d a t a s t ru c t u re s

Th e main_table g lo b a l va ria b le p o in t s t o a n fib_table o b je ct t h a t re p re s e n t s t h e s t a t ic ro u t in g t a b le o f t h e IPS a rch it e ct u re . Act u a lly, it is p o s s ib le t o d e fin e s e co n d a ry ro u t in g t a b le s , b u t t h e t a b le re fe re n ce d b y main_table is t h e m o s t im p o rt a n t o n e . Th e fib_table o b je ct in clu d e s t h e a d d re s s e s o f s o m e m e t h o d s t h a t o p e ra t e o n t h e FIB, a n d s t o re s t h e p o in t e r t o a fn_hash d a t a s t ru ct u re .

Th e fn_hash d a t a s t ru ct u re is e s s e n t ia lly a n a rra y o f 3 3 p o in t e rs , o n e fo r e ve ry FIB zo n e . A z o n e in clu d e s ro u t in g in fo rm a t io n fo r d e s t in a t io n n e t wo rks t h a t h a ve a g ive n n u m b e r o f b it s in t h e n e t wo rk m a s k. Fo r in s t a n ce , zo n e 2 4 in clu d e s e n t rie s fo r n e t wo rks t h a t h a ve t h e m a s k 2 5 5 . 2 5 5 . 2 5 5 . 0 . Ea ch zo n e is re p re s e n t e d b y a fn_zone d e s crip t o r. It re fe re n ce s , t h ro u g h a h a s h t a b le , t h e s e t o f e n t rie s o f t h e ro u t in g t a b le t h a t h a ve t h e g ive n n e t m a s k. Fo r in s t a n ce , in Fig u re 1 8 - 1 , zo n e 1 6 re fe re n ce s t h e e n t rie s 1 9 2 . 1 6 0 . 0 . 0 a n d 1 9 2 . 5 0 . 0 . 0 . Th e d a t a re la t ive t o e a ch ro u t in g t a b le e n t ry is s t o re d in a fib_node d e s crip t o r. A ro u t e r m ig h t h a ve s e ve ra l e n t rie s , b u t it u s u a lly h a s ve ry fe w n e t wo rk d e vice s . Th u s , t o a vo id wa s t in g s p a ce , t h e fib_node d e s crip t o r d o e s n o t in clu d e in fo rm a t io n a b o u t t h e n e t wo rk in t e rfa ce , b u t ra t h e r a p o in t e r t o a fib_info d e s crip t o r s h a re d b y s e ve ra l e n t rie s .

18.1.6.2 The routing cache Lo o kin g u p a ro u t e in t h e s t a t ic ro u t in g t a b le is q u it e a s lo w t a s k: t h e ke rn e l h a s t o wa lk t h e va rio u s zo n e s in t h e FIB a n d , fo r e a ch e n t ry in a zo n e , ch e ck wh e t h e r t h e lo g ica l AND b e t we e n t h e h o s t d e s t in a t io n a d d re s s a n d t h e e n t ry's n e t m a s k yie ld s t h e e n t ry's e xa ct n e t wo rk a d d re s s . To s p e e d u p ro u t in g , t h e ke rn e l ke e p s t h e m o s t re ce n t ly d is co ve re d ro u t e s in a ro u t in g ca ch e . Typ ica lly, t h e ca ch e in clu d e s s e ve ra l h u n d re d s o f e n t rie s ; t h e y a re s o rt e d s o t h a t m o re fre q u e n t ly u s e d ro u t e s a re re t rie ve d m o re q u ickly. Yo u ca n e a s ily g e t t h e co n t e n t s o f t h e ca ch e b y re a d in g t h e / p ro c/ n e t / rt _ ca ch e file . Th e m a in d a t a s t ru ct u re o f t h e ro u t in g ca ch e is t h e rt_hash_table h a s h t a b le ; it s h a s h fu n ct io n co m b in e s t h e d e s t in a t io n h o s t 's IP a d d re s s wit h o t h e r in fo rm a t io n , like t h e s o u rce a d d re s s o f t h e p a cke t a n d t h e t yp e o f s e rvice re q u ire d . In fa ct , t h e Lin u x n e t wo rkin g co d e a llo ws yo u t o fin e t u n e t h e ro u t in g p ro ce s s s o t h a t a p a cke t ca n , fo r in s t a n ce , b e ro u t e d a lo n g s e ve ra l p a t h s a cco rd in g t o wh e re t h e p a cke t ca m e fro m a n d wh a t kin d o f d a t a it is ca rryin g . Ea ch e n t ry o f t h e ca ch e is re p re s e n t e d b y a rtable d a t a s t ru ct u re , wh ich s t o re s s e ve ra l p ie ce s o f in fo rm a t io n ; a m o n g t h e m :

● ● ●

Th e s o u rce a n d d e s t in a t io n IP a d d re s s e s Th e g a t e wa y IP a d d re s s , if a n y Da t a re la t ive t o t h e ro u t e id e n t ifie d b y t h e e n t ry, s t o re d in a dst_entry e m b e d d e d in t h e

rtable d a t a s t ru ct u re ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 5 ) 18.1.6.3 The neighbor cache An o t h e r co re co m p o n e n t o f t h e n e t wo rkin g co d e is t h e s o - ca lle d "n e ig h b o r ca ch e , " wh ich in clu d e s in fo rm a t io n re la t ive t o h o s t s t h a t b e lo n g t o t h e n e t wo rks d ire ct ly lin ke d t o t h e co m p u t e r. We kn o w t h a t IP a d d re s s e s a re t h e m a in h o s t id e n t ifie rs o f t h e n e t wo rk la ye r; u n fo rt u n a t e ly, t h e y a re m e a n in g le s s fo r t h e lo we r d a t a - lin k la ye r, wh o s e p ro t o co ls a re e s s e n t ia lly h a rd wa re - d e p e n d e n t . In p ra ct ice , wh e n t h e ke rn e l h a s t o t ra n s m it a p a cke t b y m e a n s o f a g ive n n e t wo rk ca rd d e vice , it m u s t e n ca p s u la t e t h e d a t a in a fra m e ca rryin g , a m o n g o t h e r t h in g s , t h e h a rd wa re - d e p e n d e n t id e n t ifie rs o f t h e s o u rce a n d d e s t in a t io n n e t wo rk ca rd d e vice s . Mo s t lo ca l a re a n e t wo rks a re b a s e d o n t h e IEEE 8 0 2 s t a n d a rd s , a n d in p a rt icu la r, o n t h e 8 0 2 . 3 s t a n d a rd , wh ich is co m m e rcia lly kn o wn a s "Et h e rn e t . " [ 2 ] Th e n e t wo rk ca rd id e n t ifie rs o f t h e 8 0 2 s t a n d a rd s a re 4 8 - b it n u m b e rs , wh ich a re u s u a lly writ t e n a s 6 b yt e s s e p a ra t e d b y co lo n s ( s u ch a s "0 0 : 5 0 : DA: 6 1 : A7 : 8 3 ") . Th e re a re n o t wo n e t wo rk ca rd d e vice s s h a rin g t h e s a m e id e n t ifie r ( a lt h o u g h it wo u ld b e s u fficie n t t o e n s u re t h a t a ll n e t wo rk ca rd d e vice s in t h e s a m e lo ca l a re a n e t wo rk h a ve d iffe re n t id e n t ifie rs ) . [2]

Act u a lly, Et h e rn e t lo ca l a re a n e t wo rks s p ra n g u p b e fo re IEEE p u b lis h e d it s s t a n d a rd s ; u n fo rt u n a t e ly, Et h e rn e t a n d IEEE s t a n d a rd s d is a g re e in s m a ll b u t n e ve rt h e le s s cru cia l d e t a ils — fo r in s t a n ce , in t h e fo rm a t o f t h e d a t a lin k p a cke t s . Eve ry h o s t in t h e In t e rn e t is a b le t o o p e ra t e wit h b o t h s t a n d a rd s , t h o u g h .

Ho w ca n t h e ke rn e l kn o w t h e id e n t ifie r o f a re m o t e d e vice ? It u s e s a n IPS p ro t o co l n a m e d Ad d re s s Re s o lu t io n Pro t o co l ( ARP) . Ba s ica lly, t h e ke rn e l s e n d s a b ro a d ca s t p a cke t in t o t h e lo ca l a re a n e t wo rk ca rryin g t h e q u e s t io n : "Wh a t is t h e id e n t ifie r o f t h e n e t wo rk ca rd d e vice a s s o cia t e d wit h IP a d d re s s X? " As a re s u lt , t h e h o s t id e n t ifie d b y t h e s p e cifie d IP a d d re s s s e n d s a n a n s we r p a cke t ca rryin g t h e n e t wo rk ca rd d e vice id e n t ifie r. It is a wa s t e o f t im e a n d b a n d wid t h t o re p e a t t h e wh o le p ro ce s s fo r e ve ry p a cke t t o b e s e n t . Th u s , t h e ke rn e l ke e p s t h e n e t wo rk ca rd d e vice id e n t ifie r, t o g e t h e r wit h o t h e r p re cio u s d a t a co n ce rn in g t h e p h ys ica l co n n e ct io n t o t h e re m o t e d e vice , in t h e n e ig h b o r ca ch e ( o ft e n a ls o ca lle d a rp ca ch e ) . Yo u m ig h t g e t t h e co n t e n t s o f t h is ca ch e b y re a d in g t h e / p ro c/ n e t / a rp file . S ys t e m a d m in is t ra t o rs m a y a ls o e xp licit ly s e t t h e e n t rie s o f t h is ca ch e b y m e a n s o f t h e a rp co m m a n d . Ea ch e n t ry o f t h e n e ig h b o r ca ch e is a n o b je ct o f t yp e neighbour; t h e m o s t im p o rt a n t fie ld is ce rt a in ly ha, wh ich s t o re s t h e n e t wo rk ca rd d e vice id e n t ifie r. Th e e n t ry a ls o s t o re s a p o in t e r t o a hh_cache o b je ct b e lo n g in g t o t h e h a rd w a re h e a d e r ca ch e ; s in ce a ll p a cke t s s e n t t o t h e s a m e re m o t e n e t wo rk ca rd d e vice a re e n ca p s u la t e d in fra m e s h a vin g t h e s a m e h e a d e r ( e s s e n t ia lly ca rryin g t h e s o u rce a n d d e s t in a t io n d e vice id e n t ifie rs ) , t h e ke rn e l ke e p s a co p y o f t h e h e a d e r in m e m o ry t o a vo id h a vin g t o re co n s t ru ct it fro m s cra t ch fo r e ve ry p a cke t .

18.1.7 The Socket Buffer Ea ch s in g le p a cke t t ra n s m it t e d t h ro u g h a n e t wo rk d e vice is co m p o s e d o f s e ve ra l p ie ce s o f in fo rm a t io n . Be s id e s t h e p a y lo a d — t h a t is , t h e d a t a wh o s e t ra n s m is s io n ca u s e d t h e cre a t io n o f t h e p a cke t it s e lf — a ll n e t wo rk la ye rs , s t a rt in g fro m t h e d a t a lin k la ye r a n d e n d in g a t t h e t ra n s p o rt la ye r, a d d s o m e co n t ro l in fo rm a t io n . Th e fo rm a t o f a p a cke t h a n d le d b y a n e t wo rk ca rd d e vice is s h o wn in Fig u re 1 8 - 2 . Fig u re 1 8 - 2 . Th e p a c k e t fo rm a t

Th e wh o le p a cke t is b u ilt b y d iffe re n t fu n ct io n s in s e ve ra l s t a g e s . Fo r in s t a n ce , t h e UDP/ TCP h e a d e r a n d t h e IP h e a d e r a re co m p o s e d o f fu n ct io n s b e lo n g in g , re s p e ct ive ly, t o t h e t ra n s p o rt la ye r a n d t h e n e t wo rk la ye r o f t h e IPS a rch it e ct u re , wh ile t h e h a rd wa re h e a d e r a n d t ra ile r, wh ich b u ild t h e fra m e e n ca p s u la t in g t h e IP d a t a g ra m , a re writ t e n b y a s u it a b le m e t h o d s p e cific t o t h e n e t wo rk ca rd d e vice . Th e Lin u x n e t wo rkin g co d e ke e p s e a ch p a cke t in a la rg e m e m o ry a re a ca lle d a s o ck e t b u ffe r. Ea ch s o cke t b u ffe r is a s s o cia t e d wit h a d e s crip t o r, wh ich is a d a t a s t ru ct u re o f t yp e sk_buff t h a t s t o re s , a m o n g o t h e r t h in g s , p o in t e rs t o t h e fo llo win g d a t a s t ru ct u re s :



Th e Th e Th e Th e



Th e n e t wo rk d e vice 's net_device o b je ct



A d e s crip t o r o f t h e t ra n s p o rt la ye r h e a d e r A d e s crip t o r o f t h e n e t wo rk la ye r h e a d e r A d e s crip t o r o f t h e d a t a lin k la ye r h e a d e r Th e d e s t in a t io n ca ch e e n t ry ( dst_entry o b je ct )

● ● ●

● ● ●

s o cke t b u ffe r p a y lo a d — t h a t is , t h e u s e r d a t a ( in s id e t h e s o cke t b u ffe r) d a t a lin k t ra ile r ( in s id e t h e s o cke t b u ffe r) INET s o cke t ( sock o b je ct )

Th e sk_buff d a t a s t ru ct u re in clu d e s m a n y o t h e r fie ld s , like a n id e n t ifie r o f t h e n e t wo rk p ro t o co l u s e d fo r t ra n s m it t in g t h e p a cke t , a ch e cks u m fie ld , a n d t h e a rriva l t im e fo r re ce ive d p a cke t s . As a g e n e ra l ru le , t h e ke rn e l a vo id s co p yin g d a t a , b u t s im p ly p a s s e s t h e sk_buff d e s crip t o r p o in t e r, a n d t h u s t h e s o cke t b u ffe r, t o e a ch n e t wo rkin g la ye r in t u rn . Fo r in s t a n ce , wh e n p re p a rin g a p a cke t t o s e n d , t h e t ra n s p o rt la ye r s t a rt s co p yin g t h e p a ylo a d fro m t h e Us e r Mo d e b u ffe r in t o t h e h ig h e r p o rt io n o f t h e s o cke t b u ffe r; t h e n t h e t ra n s p o rt la ye r a d d s it s TCP o r UDP h e a d e r b e fo re t h e p a ylo a d . Ne xt , t h e co n t ro l is t ra n s fe rre d t o t h e n e t wo rk la ye r, wh ich re ce ive s t h e s o cke t b u ffe r d e s crip t o r a n d a d d s t h e IP h e a d e r b e fo re t h e t ra n s p o rt h e a d e r. Eve n t u a lly, t h e d a t a lin k la ye r a d d s it s h e a d e r a n d t ra ile r, a n d e n q u e u e s t h e p a cke t fo r t ra n s m is s io n . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

18.2 System Calls Related to Networking We wo n 't b e a b le t o d is cu s s a ll s ys t e m ca lls re la t e d t o n e t wo rkin g . Ho we ve r, we s h a ll e xa m in e t h e b a s ic o n e s , n a m e ly t h o s e n e e d e d t o s e n d a UDP d a t a g ra m . In m o s t Un ix- like s ys t e m s , t h e Us e r Mo d e co d e fra g m e n t t h a t s e n d s a d a t a g ra m lo o ks like t h e fo llo win g :

int sockfd; /* socket descriptor */ struct sockaddr_in addr_local, addr_remote; /* IPv4 address descriptors */ const char *mesg[] = "Hello, how are you?"; sockfd = socket(PF_INET, SOCK_DGRAM, 0); addr_local.sin_family = AF_INET; addr.sin_port = htons(50000); addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */ bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in)); addr_remote.sin_family = AF_INET; addr_remote.sin_port = htons(49152); inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr); connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in)); write(sockfd, mesg, strlen(mesg)+1); Ob vio u s ly, t h is lis t in g d o e s n o t re p re s e n t t h e co m p le t e s o u rce co d e o f t h e p ro g ra m . Fo r in s t a n ce , we h a ve n o t d e fin e d a main( ) fu n ct io n , we h a ve o m it t e d t h e p ro p e r #include d ire ct ive s fo r lo a d in g t h e h e a d e r file s , a n d we h a ve n o t ch e cke d t h e re t u rn va lu e s o f t h e s ys t e m ca lls . Ho we ve r, t h e lis t in g in clu d e s a ll n e t wo rk- re la t e d s ys t e m ca lls is s u e d b y t h e p ro g ra m t o s e n d a UDP d a t a g ra m . Le t 's d e s crib e t h e s ys t e m ca lls in t h e o rd e r t h e p ro g ra m u s e s t h e m .

18.2.1 The socket( ) System Call Th e socket( ) s ys t e m ca ll cre a t e s a n e w e n d p o in t fo r a co m m u n ica t io n b e t we e n t wo o r m o re p ro ce s s e s . In o u r e xa m p le p ro g ra m , it is in vo ke d in t h is wa y:

sockfd = socket(PF_INET, SOCK_DGRAM, 0); Th e socket( ) s ys t e m ca ll re t u rn s a file d e s crip t o r. In fa ct , a s o cke t is s im ila r t o a n o p e n e d file b e ca u s e it is p o s s ib le t o re a d a n d writ e d a t a o n it b y m e a n s o f t h e u s u a l read( ) a n d write( ) s ys t e m ca lls . Th e firs t p a ra m e t e r o f t h e socket( ) s ys t e m ca ll re p re s e n t s t h e n e t wo rk a rch it e ct u re t h a t will b e u s e d fo r t h e co m m u n ica t io n , a s we ll a s a p a rt icu la r n e t wo rk la ye r p ro t o co l a d o p t e d b y t h e n e t wo rk a rch it e ct u re . Th e PF_INET m a cro d e n o t e s b o t h t h e IPS a rch it e ct u re a n d Ve rs io n 4 o f t h e IP p ro t o co l ( IPv4 ) . Lin u x s u p p o rt s s e ve ra l d iffe re n t n e t wo rk a rch it e ct u re s ; a fe w o f t h e m a re s h o wn in Ta b le 1 8 1 e a rlie r in t h is ch a p t e r. Th e s e co n d p a ra m e t e r o f t h e socket( ) s ys t e m ca ll s p e cifie s t h e b a s ic m o d e l o f co m m u n ica t io n in s id e t h e n e t wo rk a rch it e ct u re . As we a lre a d y kn o w, t h e IPS a rch it e ct u re o ffe rs e s s e n t ia lly t wo a lt e rn a t ive m o d e ls o f co m m u n ica t io n :

SOCK_STREAM Re lia b le , co n n e ct io n - o rie n t e d , s t re a m - b a s e d co m m u n ica t io n im p le m e n t e d b y t h e TCP t ra n s p o rt p ro t o co l

SOCK_DGRAM Un re lia b le , co n n e ct io n - le s s , d a t a g ra m - b a s e d co m m u n ica t io n im p le m e n t e d b y t h e UDP t ra n s p o rt p ro t o co l Mo re o ve r, t h e s p e cia l SOCK_RAW va lu e cre a t e s a s o cke t t h a t ca n b e u s e d t o d ire ct ly a cce s s t h e n e t wo rk la ye r p ro t o co l ( in o u r ca s e , t h e IPv4 p ro t o co l) . In g e n e ra l, a n e t wo rk a rch it e ct u re m ig h t o ffe r o t h e r m o d e ls o f co m m u n ica t io n . Fo r in s t a n ce ,

SOCK_SEQPACKET s p e cifie s a re lia b le , co n n e ct io n - o rie n t e d , d a t a g ra m - b a s e d co m m u n ica t io n , wh ile SOCK_RDM s p e cifie s a re lia b le , co n n e ct io n - le s s , d a t a g ra m - b a s e d co m m u n ica t io n ; h o we ve r, n e it h e r o f t h e m is a va ila b le in t h e IPS . Th e t h ird p a ra m e t e r o f t h e socket( ) s ys t e m ca ll s p e cifie s t h e t ra n s p o rt p ro t o co l t o b e u s e d in t h e co m m u n ica t io n ; in g e n e ra l, fo r a n y m o d e l o f co m m u n ica t io n , t h e n e t wo rk a rch it e ct u re m ig h t o ffe r s e ve ra l d iffe re n t p ro t o co ls . Pa s s in g t h e va lu e 0 s e le ct s t h e d e fa u lt p ro t o co l fo r t h e s p e cifie d co m m u n ica t io n m o d e l. Of co u rs e , wh e n u s in g t h e IPS , t h e va lu e 0 s e le ct s t h e TCP t ra n s p o rt p ro t o co l ( IPPROTO_TCP) fo r t h e SOCK_STREAM m o d e l a n d t h e UDP p ro t o co l ( IPPROTO_IP) fo r t h e

SOCK_DGRAM m o d e l. On t h e o t h e r h a n d , t h e SOCK_RAW m o d e l a llo ws t h e p ro g ra m m e r t o s p e cify a n y o n e o f t h e n e t wo rk- la ye r s e rvice p ro t o co ls o f t h e IPS — fo r in s t a n ce , t h e In t e rn e t Co n t ro l Me s s a g e Pro t o co l ( IPPROTO_ICMP) , t h e Ext e rio r Ga t e wa y Pro t o co l ( IPPROTO_EGP) , o r t h e In t e rn e t Gro u p Ma n a g e m e n t Pro t o co l ( IPPROTO_IGMP) .

Th e socket( ) s ys t e m ca ll is im p le m e n t e d b y m e a n s o f t h e sys_socket( ) s e rvice ro u t in e , wh ich e s s e n t ia lly p e rfo rm s t h re e a ct io n s : 1 . Allo ca t e s a d e s crip t o r fo r t h e n e w BS D s o cke t ( s e e t h e la t e r s e ct io n S e ct io n 1 8 . 1 . 3 ) . 2 . In it ia lize s t h e n e w d e s crip t o r a cco rd in g t o t h e s p e cifie d n e t wo rk a rch it e ct u re , co m m u n ica t io n m o d e l, a n d p ro t o co l. 3 . Allo ca t e s t h e firs t a va ila b le file d e s crip t o r o f t h e p ro ce s s a n d a s s o cia t e s a n e w file o b je ct wit h t h a t file d e s crip t o r a n d wit h t h e s o cke t o b je ct .

18.2.1.1 Socket initialization Le t 's re t u rn t o t h e s e rvice ro u t in e o f t h e socket( ) s ys t e m ca ll. Aft e r h a vin g a llo ca t e d a n e w BS D s o cke t , t h e fu n ct io n m u s t in it ia lize it a cco rd in g t o t h e g ive n n e t wo rk a rch it e ct u re , co m m u n ica t io n m o d e l, a n d p ro t o co l. Fo r e ve ry kn o wn n e t wo rk a rch it e ct u re , t h e ke rn e l s t o re s a p o in t e r t o a n o b je ct o f t yp e

net_proto_family in t h e net_families a rra y. Es s e n t ia lly, t h e o b je ct ju s t d e fin e s t h e create m e t h o d , wh ich is in vo ke d wh e n e ve r t h e ke rn e l in it ia lize s a n e w s o cke t o f t h a t n e t wo rk a rch it e ct u re . Th e create m e t h o d co rre s p o n d in g t o t h e PF_INET a rch it e ct u re is im p le m e n t e d b y inet_create(

). Th is fu n ct io n ch e cks wh e t h e r t h e co m m u n ica t io n m o d e l a n d t h e p ro t o co l s p e cifie d a s p a ra m e t e rs o f t h e socket( ) s ys t e m ca ll a re co m p a t ib le wit h t h e IPS n e t wo rk a rch it e ct u re ; t h e n it a llo ca t e s a n d in it ia lize s a n e w INET s o cke t a n d lin ks it t o t h e p a re n t BS D s o cke t .

18.2.1.2 Socket's files Be fo re t e rm in a t in g , t h e socket( )'s s e rvice ro u t in e a llo ca t e s a n e w file o b je ct a n d a n e w d e n t ry o b je ct fo r t h e s o ck fs 's file o f t h e s o cke t ; t h e n it a s s o cia t e s t h e s e o b je ct s wit h t h e p ro ce s s t h a t ra is e d t h e s ys t e m ca ll t h ro u g h a n e w file d e s crip t o r ( s e e S e ct io n 1 2 . 2 . 6 ) . As fa r a s t h e VFS is co n ce rn e d , a n y file a s s o cia t e d wit h a s o cke t is in n o wa y s p e cia l. Th e co rre s p o n d in g d e n t ry o b je ct a n d in o d e o b je ct a re in clu d e d in t h e d e n t ry ca ch e a n d in t h e in o d e ca ch e , re s p e ct ive ly. Th e p ro ce s s t h a t cre a t e d t h e s o cke t ca n a cce s s t h e file b y m e a n s o f t h e s ys t e m ca lls t h a t a ct o n a lre a d y o p e n e d file s — t h a t is , t h e s ys t e m ca lls t h a t re ce ive a file d e s crip t o r a s a p a ra m e t e r. Of co u rs e , t h e file o b je ct m e t h o d s a re im p le m e n t e d b y fu n ct io n s t h a t o p e ra t e o n t h e s o cke t ra t h e r t h a n o n t h e file . As fa r a s t h e Us e r Mo d e p ro ce s s is co n ce rn e d , h o we ve r, t h e s o cke t 's file is s o m e wh a t p e cu lia r. In fa ct , a p ro ce s s ca n n e ve r is s u e a n open( ) s ys t e m ca ll o n s u ch a file b e ca u s e it n e ve r a p p e a rs o n t h e s ys t e m d ire ct o ry t re e ( re m e m b e r t h a t t h e s o ck fs s p e cia l file s ys t e m h a s n o vis ib le m o u n t p o in t ) . Fo r t h e s a m e re a s o n , it is n o t p o s s ib le t o re m o ve a s o cke t file t h ro u g h t h e unlink( ) s ys t e m ca ll: t h e in o d e s b e lo n g in g t o t h e s o ck fs file s ys t e m a re a u t o m a t ica lly d e s t ro ye d b y t h e ke rn e l wh e n e ve r t h e s o cke t is clo s e d ( re le a s e d ) .

18.2.2 The bind( ) System Call On ce t h e socket( ) s ys t e m ca ll co m p le t e s , a n e w s o cke t is cre a t e d a n d in it ia lize d . It re p re s e n t s a n e w co m m u n ica t io n ch a n n e l t h a t ca n b e id e n t ifie d b y t h e fo llo win g five e le m e n t s : p ro t o co l, lo ca l IP a d d re s s , lo ca l p o rt n u m b e r, re m o t e IP a d d re s s , a n d re m o t e p o rt n u m b e r. On ly t h e "p ro t o co l" e le m e n t h a s b e e n s e t s o fa r. He n ce , t h e n e xt a ct io n o f t h e Us e r Mo d e p ro ce s s co n s is t s o f s e t t in g t h e "lo ca l IP a d d re s s " a n d t h e "lo ca l p o rt n u m b e r. " Th e s e t wo e le m e n t s id e n t ify t h e p ro ce s s t h a t is s e n d in g p a cke t s o n t o t h e s o cke t s o t h e re ce ivin g p ro ce s s o n t h e re m o t e m a ch in e ca n d e t e rm in e wh o is t a lkin g a n d wh e re t h e a n s we rs s h o u ld b e s e n t . [ 3 ] [3]

Act u a lly, wh e n a p ro ce s s u s e s t h e UDP p ro t o co l, it ca n o m it t h e in vo ca t io n o f t h e bind( ) s ys t e m ca ll. In t h is ca s e , t h e ke rn e l a u t o m a t ica lly a s s ig n s a lo ca l a d d re s s a n d a lo ca l p o rt n u m b e r t o t h e s o cke t a s s o o n a s t h e p ro g ra m is s u e s a connect( ) o r listen( ) s ys t e m ca ll. Th e co rre s p o n d in g in s t ru ct io n s in o u r s im p le p ro g ra m a re t h e fo llo win g :

struct sockaddr_in addr_local; addr_local.sin_family = AF_INET; addr.sin_port = htons(50000); addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */ bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in)); Th e addr_local lo ca l va ria b le is o f t yp e struct sockaddr_in a n d re p re s e n t s a n IPS id e n t ifie r fo r a s o cke t . It in clu d e s t h re e s ig n ifica n t fie ld s :

sin_family Th e p ro t o co l fa m ily ( AF_INET, AF_INET6, o r AF_PACKET; t h is is t h e s a m e a s t h e m a cro s in Ta b le 1 8 - 1 ) .

sin_port Th e p o rt n u m b e r.

sin_addr Th e n e t wo rk a d d re s s . In t h e IPS a rch it e ct u re , it is co m p o s e d o f a s in g le 3 2 - b it fie ld s_addr s t o rin g t h e IP a d d re s s . Th e re fo re , o u r p ro g ra m s e t s t h e fie ld s o f t h e addr_local va ria b le t o t h e p ro t o co l fa m ily AF_INET, t h e p o rt n u m b e r 5 0 , 0 0 0 , a n d t h e IP a d d re s s 1 9 2 . 1 6 0 . 8 0 . 2 4 0 . No t ice h o w t h e d o t t e d n o t a t io n o f t h e IP a d d re s s is t ra n s la t e d in t o a h e xa d e cim a l n u m b e r. In t h e 8 0 x 8 6 a rch it e ct u re , t h e n u m b e rs a re re p re s e n t e d in t h e "lit t le e n d ia n " fo rm a t ( t h e b yt e a t lo we r a d d re s s is t h e le s s s ig n ifica n t o n e ) wh ile t h e IPS a rch it e ct u re re q u ire s t h a t t h e n u m b e rs b e re p re s e n t e d in t h e "b ig e n d ia n " fo rm a t ( t h e b yt e a t lo we r a d d re s s is t h e m o s t s ig n ifica n t o n e ) . S e ve ra l fu n ct io n s , s u ch a s htons( ) a n d htonl( ), a re u s e d t o e n s u re t h a t d a t a is s e n t in t h e n e t wo rk b yt e o rd e r; o t h e r fu n ct io n s , s u ch a s ntohs( ) a n d ntohl( ), e n s u re t h a t re ce ive d d a t a is co n ve rt e d fro m t h e n e t wo rk t o t h e h o s t b yt e o rd e r. Th e bind( ) s ys t e m ca ll re ce ive s a s p a ra m e t e rs t h e s o cke t file d e s crip t o r a n d t h e a d d re s s o f

addr_local. It a ls o re ce ive s t h e le n g t h o f t h e struct sockaddr_in d a t a s t ru ct u re ; in fa ct , bind( ) ca n b e u s e d fo r s o cke t s o f a n y n e t wo rk a rch it e ct u re , a s we ll a s fo r Un ix s o cke t s a n d a n y d iffe re n t t yp e o f s o cke t t h a t h a s a d d re s s e s o f d iffe re n t le n g t h . Th e sys_bind( ) s e rvice ro u t in e co p ie s t h e d a t a o f t h e sock_addr va ria b le in t o t h e ke rn e l a d d re s s s p a ce , re t rie ve s t h e a d d re s s o f t h e BS D s o cke t o b je ct ( struct socket) t h a t co rre s p o n d s t o t h e file d e s crip t o r, a n d in vo ke s it s bind m e t h o d . In t h e IPS a rch it e ct u re , t h is m e t h o d is im p le m e n t e d b y t h e

inet_bind( ) fu n ct io n . Th e inet_bind( ) fu n ct io n p e rfo rm s e s s e n t ia lly t h e fo llo win g o p e ra t io n s :

1 . In vo ke s t h e inet_addr_type( ) fu n ct io n t o ch e ck wh e t h e r t h e IP a d d re s s p a s s e d t o t h e

bind( ) s ys t e m ca ll co rre s p o n d s t o t h e a d d re s s o f s o m e n e t wo rk ca rd d e vice o f t h e h o s t ; if n o t , it re t u rn s a n e rro r co d e . Ho we ve r, t h e Us e r Mo d e p ro g ra m m a y p a s s t h e s p e cia l IP a d d re s s INADDR_ANY ( 0 . 0 . 0 . 0 ) , wh ich e s s e n t ia lly d e le g a t e s t o t h e ke rn e l t h e t a s k o f a s s ig n in g t h e IP s e n d e r a d d re s s . 2 . If t h e p o rt n u m b e r p a s s e d t o t h e bind( ) s ys t e m ca ll is s m a lle r t h a n 1 , 0 2 4 , ch e cks wh e t h e r t h e Us e r Mo d e p ro ce s s h a s s u p e ru s e r p rivile g e s ( t h is is t h e CAP_NET_BIND_SERVICE ca p a b ilit y; s e e S e ct io n 2 0 . 1 . 1 ) . Ho we ve r, t h e Us e r Mo d e p ro ce s s m a y p a s s t h e va lu e 0 a s t h e p o rt n u m b e r; t h e ke rn e l a s s ig n s a ra n d o m , u n u s e d p o rt n u m b e r ( s e e b e lo w) . 3 . S e t s t h e rcv_saddr a n d saddr fie ld s o f t h e INET s o cke t o b je ct wit h t h e IP a d d re s s p a s s e d t o t h e s ys t e m ca ll ( t h e fo rm e r fie ld is u s e d wh e n lo o kin g in t h e ro u t in g t a b le , wh ile t h e la t t e r is in clu d e d in t h e h e a d e r o f o u t g o in g p a cke t s ) . Us u a lly, t h e fie ld s h o ld t h e s a m e va lu e , e xce p t fo r s p e cia l t ra n s m is s io n m o d e s like b ro a d ca s t a n d m u lt ica s t in g . 4 . In vo ke s t h e get_port p ro t o co l m e t h o d o f t h e INET s o cke t o b je ct t o ch e ck wh e t h e r t h e re a lre a d y e xis t s a n INET s o cke t fo r t h e t ra n s p o rt p ro t o co l u s in g t h e s a m e lo ca l p o rt n u m b e r a n d IP a d d re s s a s t h e o n e b e in g in it ia lize d . Fo r IPv4 s o cke t s u s in g t h e UDP t ra n s p o rt p ro t o co l, t h e m e t h o d is im p le m e n t e d b y t h e udp_v4_get_port( ) fu n ct io n . To s p e e d u p t h e lo o ku p o p e ra t io n , t h e fu n ct io n u s e s a p e r- p ro t o co l h a s h t a b le . Mo re o ve r, if t h e Us e r Mo d e

p ro g ra m s p e cifie d a va lu e o f 0 fo r t h e p o rt , t h e fu n ct io n a s s ig n s a n u n u s e d n u m b e r t o t h e s o cke t . 5 . S t o re s t h e lo ca l p o rt n u m b e r in t h e sport fie ld o f t h e INET s o cke t o b je ct .

18.2.3 The connect( ) System Call Th e n e xt o p e ra t io n o f t h e Us e r Mo d e p ro ce s s co n s is t s o f s e t t in g t h e "re m o t e IP a d d re s s " a n d t h e "re m o t e p o rt n u m b e r, " s o t h e ke rn e l kn o ws wh e re d a t a g ra m s writ t e n t o t h e s o cke t h a ve t o b e s e n t . Th is is a ch ie ve d b y in vo kin g t h e connect( ) s ys t e m ca ll.

It is im p o rt a n t t o o b s e rve t h a t a Us e r Mo d e p ro g ra m is in n o wa y o b lig e d t o co n n e ct a UDP s o cke t t o a d e s t in a t io n h o s t . In fa ct , t h e p ro g ra m m a y u s e t h e sendto( ) a n d sendmsg( ) s ys t e m ca lls t o t ra n s m it d a t a g ra m s o ve r t h e s o cke t , e a ch t im e s p e cifyin g t h e d e s t in a t io n h o s t 's IP a d d re s s a n d p o rt n u m b e r. S im ila rly, t h e p ro g ra m m a y re ce ive d a t a g ra m s fro m a UDP s o cke t b y in vo kin g t h e recvfrom( ) a n d recvmsg( ) s ys t e m ca lls . Ho we ve r, t h e connect( ) s ys t e m ca ll is re q u ire d if t h e Us e r Mo d e p ro g ra m t ra n s fe rs d a t a o n t h e s o cke t b y m e a n s o f t h e read( ) a n d write( ) s ys t e m ca ll. S in ce o u r p ro g ra m is g o in g t o u s e t h e write( ) s ys t e m ca ll t o s e n d it s d a t a g ra m , it in vo ke s

connect( ) t o s e t u p t h e d e s t in a t io n o f t h e m e s s a g e . Th e re le va n t in s t ru ct io n s a re : struct sockaddr_in addr_remote; addr_remote.sin_family = AF_INET; addr_remote.sin_port = htons(49152); inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr); connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in)); Th e p ro g ra m in it ia lize s t h e addr_remote lo ca l va ria b le b y writ in g in t o it t h e IP a d d re s s 1 9 2 . 1 6 0 . 8 0 . 1 1 0 a n d t h e p o rt n u m b e r 4 9 , 1 5 2 . Th is is ve ry s im ila r t o t h e in it ia liza t io n o f t h e

addr_local va ria b le d is cu s s e d in t h e p re vio u s s e ct io n ; h o we ve r, t h is t im e t h e p ro g ra m in vo ke d t h e inet_pton( ) lib ra ry h e lp e r fu n ct io n t o co n ve rt a s t rin g re p re s e n t in g t h e IP a d d re s s in d o t t e d n o t a t io n in t o a n u m b e r in t h e n e t wo rk o rd e r fo rm a t . Th e connect( ) s ys t e m ca ll re ce ive s t h e s a m e p a ra m e t e rs a s t h e bind( ) s ys t e m ca ll. It co p ie s t h e d a t a o f t h e addr_remote va ria b le in t o t h e ke rn e l a d d re s s s p a ce , re t rie ve s t h e a d d re s s o f t h e BS D s o cke t o b je ct ( struct socket) co rre s p o n d in g t o t h e file d e s crip t o r, a n d in vo ke s it s connect m e t h o d . In IPS a rch it e ct u re , t h is m e t h o d is im p le m e n t e d b y e it h e r t h e inet_dgram_connect( ) fu n ct io n fo r UDP o r t h e inet_stream_connect( ) fu n ct io n fo r TCP.

Ou r s im p le p ro g ra m u s e s a UDP s o cke t , s o le t 's d e s crib e wh a t t h e inet_dgram_connection( ) fu n ct io n d o e s : 1 . If t h e s o cke t d o e s n o t h a ve a lo ca l p o rt n u m b e r, in vo ke s inet_autobind( ) t o a u t o m a t ica lly a s s ig n a u n u s e d va lu e . In o u r ca s e , t h e p ro g ra m is s u e d a bind( ) s ys t e m ca ll b e fo re in vo kin g collect( ), b u t a n a p p lica t io n u s in g UDP is n o t re a lly o b lig e d t o d o s o .

2 . In vo ke s t h e connect m e t h o d o f t h e INET s o cke t o b je ct .

Th e UDP p ro t o co l im p le m e n t s t h e INET s o cke t 's connect m e t h o d b y m e a n s o f t h e udp_connect(

) fu n ct io n , wh ich e xe cu t e s t h e fo llo win g a ct io n s :

1 . If t h e INET s o cke t a lre a d y h a s a d e s t in a t io n h o s t , re m o ve s it fro m t h e d e s t in a t io n ca ch e ( wh ich is t h e dst_cache fie ld o f t h e sock o b je ct ; s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 5 ) .

2 . In vo ke s t h e ip_route_connect( ) fu n ct io n t o e s t a b lis h a ro u t e t o t h e h o s t id e n t ifie d b y t h e IP a d d re s s p a s s e d a s a p a ra m e t e r o f connect( ). In t u rn , t h is fu n ct io n in vo ke s

ip_route_output_key( ) t o s e a rch a n e n t ry co rre s p o n d in g t o t h e ro u t e in t h e ro u t e ca ch e ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 6 . 2 ) . If t h e ro u t e ca ch e d o e s n o t in clu d e t h e d e s ire d e n t ry, ip_route_output_key( ) in vo ke s ip_route_output_slow( ) t o lo o k u p a s u it a b le e n t ry in t h e FIB ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 6 . 1 ) . Le t 's a s s u m e t h a t , o n ce t h is s t e p t e rm in a t e s , a ro u t e is fo u n d , s o t h e a d d re s s o f a s u it a b le rtable o b je ct is d e t e rm in e d . 3 . In it ia lize s t h e daddr fie ld o f t h e INET s o cke t o b je ct wit h t h e re m o t e IP a d d re s s fo u n d in t h e

rtable o b je ct . Us u a lly, it co in cid e s wit h t h e IP a d d re s s s p e cifie d b y t h e u s e r a s a p a ra m e t e r o f t h e connect( ) s ys t e m ca ll. 4 . In it ia lize s t h e dport fie ld o f t h e INET s o cke t o b je ct wit h t h e re m o t e p o rt n u m b e r s p e cifie d a s a p a ra m e t e r o f t h e connect( ) s ys t e m ca ll.

5 . Pu t s t h e va lu e TCP_ESTABLISHED in t h e state fie ld o f t h e INET s o cke t o b je ct ( wh e n u s e d b y UDP, t h e fla g in d ica t e s t h a t t h e INET s o cke t is "co n n e ct e d " t o a d e s t in a t io n h o s t ) . 6 . S e t s t h e dst_cache e n t ry o f t h e sock o b je ct t o t h e a d d re s s o f t h e dst_entry o b je ct e m b e d d e d in t h e rtable o b je ct ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 5 ) .

18.2.4 Writing Packets to a Socket Fin a lly, o u r e xa m p le p ro g ra m is re a d y t o s e n d m e s s a g e s t o t h e re m o t e h o s t ; it s im p ly writ e s t h e d a t a o n t o t h e s o cke t :

write(sockfd, mesg, strlen(mesg)+1); Th e write( ) s ys t e m ca ll t rig g e rs t h e write m e t h o d o f t h e file o b je ct a s s o cia t e d wit h t h e sockfd file d e s crip t o r. Fo r s o cke t file s , t h is m e t h o d is im p le m e n t e d b y t h e sock_write( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g a ct io n s : 1 . De t e rm in e s t h e a d d re s s o f t h e socket o b je ct e m b e d d e d in t h e file 's in o d e .

2 . Allo ca t e s a n d in it ia lize s a "m e s s a g e h e a d e r"; n a m e ly, a msghdr d a t a s t ru ct u re , wh ich s t o re s va rio u s co n t ro l in fo rm a t io n . 3 . In vo ke s t h e sock_sendmsg( ) fu n ct io n , p a s s in g t o it t h e a d d re s s e s o f t h e socket o b je ct a n d t h e msghdr d a t a s t ru ct u re . In t u rn , t h is fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s :

a . In vo ke s scm_send( ) t o ch e ck t h e co n t e n t s o f t h e m e s s a g e h e a d e r a n d a llo ca t e a

scm_cookie ( s o ck e t co n t ro l m e s s a g e ) d a t a s t ru ct u re , s t o rin g in t o it a fe w fie ld s o f co n t ro l in fo rm a t io n d is t ille d fro m t h e m e s s a g e h e a d e r. b . In vo ke s t h e sendmsg m e t h o d o f t h e socket o b je ct , p a s s in g t o it t h e a d d re s s e s o f t h e s o cke t o b je ct , m e s s a g e h e a d e r, a n d scm_cookie d a t a s t ru ct u re .

c. In vo ke s scm_destroy( ) t o re le a s e t h e scm_cookie d a t a s t ru ct u re .

S in ce t h e BS D s o cke t h a s b e e n s e t u p s p e cifyin g t h e UDP p ro t o co l, t h e a d d re s s e s o f t h e socket o b je ct 's m e t h o d s a re s t o re d in t h e inet_dgram_ops t a b le . In p a rt icu la r, t h e sendmsg m e t h o d is im p le m e n t e d b y t h e inet_sendmsg( ) fu n ct io n , wh ich e xt ra ct s t h e a d d re s s o f t h e INET s o cke t s t o re d in t h e BS D s o cke t a n d in vo ke s t h e sendmsg m e t h o d o f t h e INET s o cke t .

Ag a in , s in ce t h e INET s o cke t h a s b e e n s e t u p s p e cifyin g t h e UDP p ro t o co l, t h e a d d re s s e s o f t h e sock o b je ct 's m e t h o d s a re s t o re d in t h e udp_prot t a b le . In p a rt icu la r, t h e sendmsg m e t h o d is im p le m e n t e d b y t h e udp_sendmsg( ) fu n ct io n .

18.2.4.1 Transport layer: the udp_sendmsg( ) function Th e udp_sendmsg( ) fu n ct io n re ce ive s a s p a ra m e t e rs t h e a d d re s s e s o f t h e sock o b je ct a n d t h e m e s s a g e h e a d e r ( msghdr d a t a s t ru ct u re ) , a n d p e rfo rm s t h e fo llo win g a ct io n s :

1 . Allo ca t e s a udpfakehdr d a t a s t ru ct u re , wh ich co n t a in s t h e UDP h e a d e r o f t h e p a cke t t o b e se nt. 2 . De t e rm in e s t h e a d d re s s o f t h e rtable d e s crib in g t h e ro u t e t o t h e d e s t in a t io n h o s t fro m t h e

dst_cache fie ld o f t h e sock o b je ct . 3 . In vo ke s ip_build_xmit( ), p a s s in g t o it t h e a d d re s s e s o f a ll re le va n t d a t a s t ru ct u re s , like t h e sock o b je ct , t h e UDP h e a d e r, t h e rtable o b je ct , a n d t h e a d d re s s o f a UDP- s p e cific fu n ct io n t h a t co n s t ru ct s t h e p a cke t t o b e t ra n s m it t e d .

18.2.4.2 Transport and network layers: the ip_build_xmit( ) function Th e ip_build_xmit( ) fu n ct io n is u s e d t o t ra n s m it a n IP d a t a g ra m . It p e rfo rm s t h e fo llo win g a ct io n s : 1 . In vo ke s sock_alloc_send_skb( ) t o a llo ca t e a n e w s o cke t b u ffe r t o g e t h e r wit h t h e co rre s p o n d in g s o cke t b u ffe r d e s crip t o r ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 7 ) . 2 . De t e rm in e s t h e p o s it io n in s id e t h e s o cke t b u ffe r wh e re t h e p a ylo a d s h a ll g o ( t h e p a ylo a d is p la ce d n e a r t h e e n d o f t h e s o cke t b u ffe r, s o it s p o s it io n d e p e n d s o n t h e p a ylo a d s ize ) . 3 . Writ e s t h e IP h e a d e r o n t h e s o cke t b u ffe r, le a vin g s p a ce fo r t h e UDP h e a d e r. 4 . In vo ke s e it h e r udp_getfrag_nosum( ) o r udp_getfrag( ) t o co p y t h e d a t a o f t h e UDP d a t a g ra m fro m t h e Us e r Mo d e b u ffe r; t h e la t t e r fu n ct io n a ls o co m p u t e s , if re q u ire d , t h e ch e cks u m o f t h e d a t a a n d o f t h e UDP h e a d e r ( t h e UDP s t a n d a rd s p e cifie s t h a t t h is ch e cks u m co m p u t a t io n b e o p t io n a l) . [ 4 ] [ 4 ] Yo u m ig h t wo n d e r wh y t h e IP h e a d e r is writ t e n in t h e s o cke t b u ffe r b e fo re t h e UDP h e a d e r. We ll, t h e UDP s t a n d a rd d ict a t e s t h a t t h e ch e cks u m , if u s e d , h a s t o b e co m p u t e d o n t h e p a ylo a d , t h e UDP h e a d e r, a n d t h e la s t 1 2 b yt e s o f t h e IP h e a d e r ( in clu d in g t h e s o u rce a n d d e s t in a t io n IP a d d re s s e s ) . Th e s im p le s t wa y t o co m p u t e t h e UDP ch e cks u m is t h u s t o writ e t h e IP h e a d e r b e fo re t h e UDP h e a d e r.

5 . In vo ke s t h e output m e t h o d o f t h e dst_entry o b je ct , p a s s in g t o it t h e a d d re s s o f t h e

s o cke t b u ffe r d e s crip t o r.

18.2.4.3 Data link layer: composing the hardware header Th e output m e t h o d o f t h e dst_entry o b je ct in vo ke s t h e fu n ct io n o f t h e d a t a lin k la ye r t h a t writ e s t h e h a rd wa re h e a d e r ( a n d t ra ile r, if re q u ire d ) o f t h e p a cke t in t h e b u ffe r. Th e output m e t h o d o f t h e IP's dst_entry o b je ct is u s u a lly im p le m e n t e d b y t h e ip_output( ) fu n ct io n , wh ich re ce ive s a s a p a ra m e t e r t h e a d d re s s skb o f t h e s o cke t b u ffe r d e s crip t o r. In t u rn , t h is fu n ct io n e s s e n t ia lly p e rfo rm s t h e fo llo win g a ct io n s : ●

Ch e cks wh e t h e r t h e re is a lre a d y a s u it a b le h a rd wa re h e a d e r d e s crip t o r in t h e ca ch e b y lo o kin g a t t h e hh fie ld o f t h e skb->dst d e s t in a t io n ca ch e o b je ct ( s e e t h e e a rlie r s e ct io n S e ct io n 1 8 . 1 . 5 ) . If t h e fie ld is n o t NULL, t h e ca ch e in clu d e s t h e h e a d e r, s o it co p ie s t h e h a rd wa re h e a d e r in t o t h e s o cke t b u ffe r, a n d t h e n in vo ke s t h e hh_output m e t h o d o f t h e



hh_cache o b je ct . Ot h e rwis e , if t h e skb->dst->hh fie ld is NULL, t h e h e a d e r m u s t b e p re p a re d fro m s cra t ch . Th u s , t h e fu n ct io n in vo ke s t h e output m e t h o d o f t h e neighbour o b je ct p o in t e d t o b y t h e neighbour fie ld o f skb->dst, wh ich is im p le m e n t e d b y t h e neigh_resolve_output( ) fu n ct io n . To co m p o s e t h e h e a d e r, t h e la t t e r fu n ct io n in vo ke s a s u it a b le m e t h o d o f t h e net_device o b je ct re la t ive t o t h e n e t wo rk ca rd d e vice t h a t s h a ll t ra n s m it t h e p a cke t , a n d t h e n in s e rt s t h e n e w h a rd wa re h e a d e r in t h e ca ch e .

Bo t h t h e hh_output m e t h o d o f t h e hh_cache o b je ct a n d t h e output m e t h o d o f t h e neighbour o b je ct e n d u p in vo kin g t h e dev_queue_xmit( ) fu n ct io n .

18.2.4.4 Data link layer: enqueueing the socket buffer for transmission Th e dev_queue_xmit( ) fu n ct io n t a ke s ca re o f q u e u e in g t h e s o cke t b u ffe r fo r la t e r t ra n s m is s io n . In g e n e ra l, n e t wo rk ca rd s a re s lo w d e vice s , a n d a t a n y g ive n in s t a n t t h e re ca n b e m a n y p a cke t s wa it in g t o b e t ra n s m it t e d . Th e y a re u s u a lly p ro ce s s e d wit h a Firs t - In , Firs t - Ou t p o licy ( h e n ce t h e q u e u e o f p a cke t s ) , e ve n if t h e Lin u x ke rn e l o ffe rs s e ve ra l s o p h is t ica t e d p a cke t s ch e d u lin g a lg o rit h m s t o b e u s e d in h ig h - p e rfo rm a n ce ro u t e rs . As a g e n e ra l ru le , a ll n e t wo rk ca rd d e vice s d e fin e t h e ir o wn q u e u e o f p a cke t s wa it in g t o b e t ra n s m it t e d . Exce p t io n s a re virt u a l d e vice s like t h e lo o p b a ck d e vice ( lo ) a n d t h e d e vice s o ffe re d b y va rio u s t u n n e lin g p ro t o co ls , b u t we d o n 't d is cu s s t h e s e fu rt h e r. A q u e u e o f s o cke t b u ffe rs is im p le m e n t e d t h ro u g h a co m p le x Qdisc o b je ct . Th a n ks t o t h is d a t a s t ru ct u re , t h e p a cke t s ch e d u lin g fu n ct io n s ca n e fficie n t ly m a n ip u la t e t h e q u e u e a n d q u ickly s e le ct t h e "b e s t " p a cke t t o b e s e n t . Ho we ve r, fo r t h e p u rp o s e o f o u r s im p le d e s crip t io n , t h e q u e u e is ju s t a lis t o f s o cke t b u ffe r d e s crip t o rs . Es s e n t ia lly, dev_queue_xmit( ) p e rfo rm s t h e fo llo win g a ct io n s :

1 . Ch e cks wh e t h e r t h e d rive r o f t h e n e t wo rk d e vice ( wh o s e d e s crip t o r is s t o re d in t h e dev fie ld o f t h e s o cke t b u ffe r d e s crip t o r) d e fin e s it s o wn q u e u e o f p a cke t s wa it in g t o b e t ra n s m it t e d ( t h e a d d re s s o f t h e Qdisc o b je ct is s t o re d in t h e qdisc fie ld o f t h e net_device o b je ct ) .

2 . In vo ke s t h e enqueue m e t h o d o f t h e co rre s p o n d in g Qdisc o b je ct t o a p p e n d t h e s o cke t b u ffe r to the que ue . 3 . In vo ke s t h e qdisc_run( ) fu n ct io n t o e n s u re t h a t t h e n e t wo rk d e vice is a ct ive ly s e n d in g t h e p a cke t s in t h e q u e u e .

Th e ch a in o f fu n ct io n s e xe cu t e d b y t h e sys_write( ) s ys t e m ca ll s e rvice ro u t in e e n d s h e re . As yo u s e e , t h e fin a l re s u lt co n s is t s o f a n e w p a cke t t h a t is a p p e n d e d t o t h e t ra n s m it q u e u e o f a n e t wo rk ca rd d e vice . In t h e n e xt s e ct io n , we lo o k a t h o w o u r p a cke t is p ro ce s s e d b y t h e n e t wo rk ca rd .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

18.3 Sending Packets to the Network Card A n e t wo rk ca rd d e vice d rive r is u s u a lly s t a rt e d e it h e r wh e n t h e ke rn e l in s e rt s a p a cke t in it s t ra n s m it q u e u e ( a s d e s crib e d in t h e p re vio u s s e ct io n ) , o r wh e n a p a cke t is re ce ive d fro m t h e co m m u n ica t io n ch a n n e l. Le t 's fo cu s h e re o n p a cke t t ra n s m is s io n . As we h a ve s e e n , t h e qdisc_run( ) fu n ct io n is in vo ke d wh e n e ve r t h e ke rn e l wis h e s t o a ct iva t e a n e t wo rk ca rd d e vice d rive r; it is a ls o e xe cu t e d b y t h e NET_TX_SOFTIRQ s o ft irq , wh ich is im p le m e n t e d b y t h e net_tx_action( ) fu n ct io n ( s e e S e ct io n 4 . 7 ) .

Es s e n t ia lly, t h e qdisc_run( ) fu n ct io n ch e cks wh e t h e r t h e n e t wo rk ca rd d e vice is id le a n d ca n t h u s t ra n s m it t h e p a cke t s in t h e q u e u e . If t h e d e vice ca n n o t d o t h is — fo r in s t a n ce , b e ca u s e t h e ca rd is a lre a d y b u s y in t ra n s m it t in g o r re ce ivin g a p a cke t , t h e q u e u e h a s b e e n s t o p p e d t o a vo id flo o d in g t h e co m m u n ica t io n ch a n n e l, o r fo r wh a t e ve r o t h e r re a s o n — t h e NET_TX_SOFTIRQ s o ft irq is a ct iva t e d a n d t h e cu rre n t e xe cu t io n o f qdisc_run( ) is t e rm in a t e d . At a la t e r t im e , wh e n t h e s ch e d u le r s e le ct s a k s o ft irq d _ CPUn ke rn e l t h re a d , t h e net_tx_action( ) fu n ct io n in vo ke s qdisc_run( ) a g a in t o re t ry t h e p a cke t t ra n s m is s io n . In p a rt icu la r, qdisc_run( ) p e rfo rm s t h e fo llo win g a ct io n s :

1 . Ch e cks wh e t h e r t h e p a cke t q u e u e is "s t o p p e d " — t h a t is , wh e t h e r a s u it a b le b it in t h e state fie ld o f t h e net_device n e t wo rk ca rd o b je ct is s e t . If it is s t o p p e d , t h e fu n ct io n re t u rn s im m e d ia t e ly. 2 . In vo ke s t h e qdisc_restart( ) fu n ct io n , wh ich in t u rn p e rfo rm s t h e fo llo win g a ct io n s : a . In vo ke s t h e dequeue m e t h o d o f t h e Qdisc p a cke t q u e u e t o e xt ra ct a p a cke t fro m t h e q u e u e . If t h e q u e u e is e m p t y, it t e rm in a t e s . b . Ch e cks wh e t h e r a p a cke t s n iffin g p o licy is e n fo rce d o n t h e ke rn e l, t e llin g it t o p a s s a co p y o f e a ch o u t g o in g p a cke t t o a lo ca l s o cke t ; in t h is ca s e , t h e fu n ct io n in vo ke s t h e dev_queue_xmit_nit( ) fu n ct io n t o d o t h e jo b . We wo n 't d is cu s s t h is fu rt h e r. c. In vo ke s t h e hard_start_xmit m e t h o d o f t h e net_device o b je ct t h a t d e s crib e s t h e n e t wo rk ca rd d e vice . d . If t h e hard_start_xmit m e t h o d fa ils in t ra n s m it t in g t h e p a cke t , it re in s e rt s t h e p a cke t in t h e q u e u e a n d in vo ke s cpu_raise_softirq( ) t o s ch e d u le t h e a ct iva t io n o f t h e NET_TX_SOFTIRQ s o ft irq .

3 . If t h e q u e u e is n o w e m p t y, o r t h e hard_start_xmit m e t h o d fa ils in t ra n s m it t in g t h e p a cke t , t h e fu n ct io n t e rm in a t e s . Ot h e rwis e , it ju m p s t o S t e p 1 t o p ro ce s s a n o t h e r p a cke t in t h e q u e u e .

Th e hard_start_xmit m e t h o d is s p e cific t o t h e n e t wo rk ca rd d e vice a n d t a ke s ca re o f t ra n s fe rrin g t h e p a cke t fro m t h e s o cke t b u ffe r t o t h e d e vice 's m e m o ry. Typ ica lly, t h e m e t h o d lim it s it s e lf t o a ct iva t e a DMA t ra n s fe r. In PCI- b a s e d n e t wo rk ca rd s , m o re o ve r, a s m a ll n u m b e r o f DMA t ra n s fe rs m a y u s u a lly b e b o o ke d in a d va n ce : t h e y a re a u t o m a t ica lly a ct iva t e d b y t h e ca rd wh e n e ve r it fin is h e s t h e o n g o in g DMA t ra n s fe rs . If t h e ca rd is n o t a b le t o a cce p t fu rt h e r p a cke t s b e ca u s e t h e d e vice 's m e m o ry is fu ll, t h e m e t h o d s t o p s t h e p a cke t q u e u e b y s e t t in g t h e p ro p e r b it in t h e state fie ld o f t h e net_device o b je ct . Th e re fo re , t h e

qdisc_run( ) fu n ct io n t e rm in a t e s a n d is p re s u m a b ly e xe cu t e d a g a in la t e r b y t h e s o ft irq . Wh e n a DMA t ra n s fe r e n d s , t h e ca rd ra is e s a n in t e rru p t . Th e co rre s p o n d in g in t e rru p t h a n d le r, in t u rn , p e rfo rm s t h e fo llo win g a ct io n s : 1 . Ackn o wle d g e s t h e in t e rru p t is s u e d b y t h e ca rd . 2 . Ch e cks fo r t ra n s m is s io n e rro rs , u p d a t e s d rive r s t a t is t ics , a n d s o o n . 3 . In vo ke s , if n e ce s s a ry, t h e cpu_raise_softirq( ) fu n ct io n t o s ch e d u le t h e a ct iva t io n o f t h e s o ft irq . 4 . If t h e q u e u e is s t o p p e d , re s e t s t h e b it in t h e state fie ld o f t h e net_device o b je ct a n d re s t a rt s p a cke t p ro ce s s in g . As yo u s e e , n e t wo rk ca rd d e vice d rive rs wo rk like d is k d e vice d rive rs : t h e re a l wo rk is m o s t ly d o n e in in t e rru p t h a n d le rs a n d d e fe rra b le fu n ct io n s , s o t h a t u s u a l p ro ce s s e s a re n o t b lo cke d wa it in g fo r p a cke t t ra n s m is s io n s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

18.4 Receiving Packets from the Network Card Th is ch a p t e r is m o s t ly fo cu s e d o n h o w t h e ke rn e l h a n d le s t h e t ra n s m is s io n o f n e t wo rk p a cke t s . We h a ve a lre a d y g lim p s e d a t m a n y cru cia l d a t a s t ru ct u re s o f t h e n e t wo rkin g co d e , s o we will ju s t g ive a b rie f d e s crip t io n o f t h e o t h e r s id e o f t h e s t o ry; n a m e ly, h o w a n e t wo rk p a cke t is re ce ive d . Th e m a in d iffe re n ce b e t we e n t ra n s m it t in g a n d re ce ivin g is t h a t t h e ke rn e l ca n n o t p re d ict wh e n a p a cke t will a rrive a t a n e t wo rk ca rd d e vice . Th e re fo re , t h e n e t wo rkin g co d e t h a t t a ke s ca re o f re ce ivin g t h e p a cke t s ru n s in in t e rru p t h a n d le rs a n d d e fe rra b le fu n ct io n s . Le t 's s ke t ch a t yp ica l ch a in o f e ve n t s o ccu rrin g wh e n a p a cke t ca rryin g t h e rig h t h a rd wa re a d d re s s ( ca rd id e n t ifie r) a rrive s t o t h e n e t wo rk d e vice . 1 . Th e n e t wo rk d e vice s a ve s t h e p a cke t in a b u ffe r in t h e d e vice 's m e m o ry ( t h e ca rd u s u a lly ke e p s s e ve ra l p a cke t s a t o n ce in a circu la r b u ffe r) . 2 . Th e n e t wo rk d e vice ra is e s a n in t e rru p t . 3 . Th e in t e rru p t h a n d le r a llo ca t e s a n d in it ia lize s a n e w s o cke t b u ffe r fo r t h e p a cke t . 4 . Th e in t e rru p t h a n d le r co p ie s t h e p a cke t fro m t h e d e vice 's m e m o ry t o t h e s o cke t b u ffe r. 5 . Th e in t e rru p t h a n d le r in vo ke s a fu n ct io n ( s u ch a s eth_type_trans( ) fu n ct io n fo r Et h e rn e t a n d IEEE 8 0 2 . 3 ) t o d e t e rm in e t h e p ro t o co l o f t h e p a cke t e n ca p s u la t e d in t h e d a t a lin k fra m e . 6 . Th e in t e rru p t h a n d le r in vo ke s t h e netif_rx( ) fu n ct io n t o n o t ify t h e Lin u x n e t wo rkin g co d e t h a t a n e w p a cke t is a rrive d a n d s h o u ld b e p ro ce s s e d . Of co u rs e , t h e in t e rru p t h a n d le r is s p e cific t o t h e n e t wo rk ca rd d e vice . Ma n y d e vice d rive rs t ry t o b e n ice t o t h e o t h e r d e vice s in t h e s ys t e m a n d m o ve le n g t h y t a s ks , s u ch a s a llo ca t in g a s o cke t b u ffe r o r co p yin g a p a cke t t o d e fe rra b le fu n ct io n s . Th e netif_rx( ) fu n ct io n is t h e m a in e n t ry p o in t o f t h e re ce ivin g co d e o f t h e n e t wo rkin g la ye r ( a b o ve t h e n e t wo rk ca rd d e vice d rive r) . Th e ke rn e l u s e s a p e r- CPU q u e u e fo r t h e p a cke t s t h a t h a ve b e e n re ce ive d fro m t h e n e t wo rk d e vice s a n d a re wa it in g t o b e p ro ce s s e d b y t h e va rio u s p ro t o co l s t a ck la ye rs . Th e fu n ct io n e s s e n t ia lly a p p e n d s t h e n e w p a cke t in t h is q u e u e a n d in vo ke s cpu_raise_softirq( ) t o s ch e d u le t h e a ct iva t io n o f t h e

NET_RX_SOFTIRQ s o ft irq . ( Re m e m b e r t h a t t h e s a m e s o ft irq ca n b e e xe cu t e d co n cu rre n t ly o n s e ve ra l CPUs , h e n ce t h e re a s o n fo r t h e p e r- CPU q u e u e o f re ce ive d p a cke t s . ) Th e NET_RX_SOFTIRQ s o ft irq is im p le m e n t e d b y t h e net_rx_action( ) fu n ct io n , wh ich e s s e n t ia lly e xe cu t e s t h e fo llo win g o p e ra t io n s : [ 5 ] [5]

We o m it d is cu s s in g s e ve ra l s p e cia l ca s e s , s u ch a s wh e n t h e p a cke t h a s t o b e q u ickly fo rwa rd e d t o a n o t h e r n e t wo rk ca rd

d e vice o r wh e n t h e h o s t is a ct in g a s a b rid g e t h a t lin ks t wo lo ca l a re a n e t wo rk a s if t h e y we re a s in g le o n e . 1 . Ext ra ct s t h e firs t p a cke t fro m t h e q u e u e . If t h e q u e u e is e m p t y, it t e rm in a t e s . 2 . De t e rm in e s t h e n e t wo rk la ye r p ro t o co l n u m b e r e n co d e d in t h e d a t a lin k la ye r. 3 . In vo ke s a s u it a b le fu n ct io n o f t h e n e t wo rk la ye r p ro t o co l. Th e co rre s p o n d in g fu n ct io n fo r t h e IP p ro t o co l is n a m e d ip_rcv( ), wh ich e s s e n t ia lly e xe cu t e s t h e fo llo win g a ct io n s : 1 . Ch e cks t h e le n g t h a n d t h e ch e cks u m o f t h e p a cke t a n d d is ca rd s it if it is co rru p t e d o r t ru n ca t e d . 2 . In vo ke s ip_route_input( ), wh ich in it ia lize s t h e d e s t in a t io n ca ch e ( dst_entry fie ld ) o f t h e s o cke t b u ffe r d e s crip t o r. To d e t e rm in e t h e ro u t e fo llo we d b y t h e p a cke t , t h e fu n ct io n lo o ks t h e ro u t e u p firs t in t h e ro u t e ca ch e , a n d t h e n in t h e FIB ( if t h e ro u t e ca ch e d o e s n 't in clu d e a re le va n t e n t ry) . In t h is wa y, t h e ke rn e l d e t e rm in e s wh e t h e r t h e p a cke t m u s t b e fo rwa rd e d t o a n o t h e r h o s t o r s im p ly p a s s e d t o a p ro t o co l o f t h e t ra n s p o rt la ye r. 3 . Ch e cks t o s e e wh e t h e r a n y p a cke t s n iffin g o r o t h e r in p u t p o licy is e n fo rce d . In t h e a ffirm a t ive ca s e , it h a n d le s t h e p a cke t a cco rd in g ly; we d o n 't d is cu s s t h e s e t o p ics fu rt h e r. 4 . In vo ke s t h e input m e t h o d o f t h e dst_entry o b je ct o f t h e p a cke t .

If t h e p a cke t h a s t o b e fo rwa rd e d t o a n o t h e r h o s t , t h e in p u t m e t h o d is im p le m e n t e d b y t h e ip_forward( ) fu n ct io n ; o t h e rwis e , it is im p le m e n t e d b y t h e ip_local_delivery( ) fu n ct io n . Le t 's fo llo w t h e la t t e r p a t h . Th e ip_local_delivery( ) fu n ct io n t a ke s ca re o f re a s s e m b lin g t h e o rig in a l IP d a t a g ra m , if t h e d a t a g ra m h a s b e e n fra g m e n t e d a lo n g it s wa y. Th e n t h e fu n ct io n re a d s t h e IP h e a d e r a n d d e t e rm in e s t h e t yp e o f t ra n s p o rt p ro t o co l t o wh ich t h e p a cke t b e lo n g s . If t h e t ra n s p o rt p ro t o co l is TCP, t h e fu n ct io n e n d s u p in vo kin g tcp_v4_rcv( ); if t h e t ra n s p o rt p ro t o co l is UDP, t h e fu n ct io n e n d s u p in vo kin g udp_rcv( ).

Le t 's co n t in u e fo llo win g t h e UDP p a t h . Th e udp_rcv( ) fu n ct io n e s s e n t ia lly e xe cu t e s t h e fo llo win g a ct io n s : 1 . In vo ke s t h e udp_v4_lookup( ) fu n ct io n t o fin d t h e INET s o cke t t o wh ich t h e UDP d a t a g ra m h a s b e e n s e n t ( b y lo o kin g a t t h e p o rt n u m b e r in s id e t h e UDP h e a d e r) . Th e ke rn e l ke e p s t h e INET s o cke t in a h a s h t a b le s o t h a t t h e lo o ku p o p e ra t io n is re a s o n a b ly fa s t . If t h e UDP d a t a g ra m is n o t a s s o cia t e d wit h a s o cke t , t h e fu n ct io n d is ca rd s t h e p a cke t a n d t e rm in a t e s . 2 . In vo ke s udp_queue_rcv_skb( ), wh ich in t u rn in vo ke s sock_queue_rcv_skb(

), t o a p p e n d t h e p a cke t in t o a q u e u e o f t h e INET s o cke t ( receive_queue fie ld o f t h e sock o b je ct ) a n d t o in vo ke t h e data_ready m e t h o d o f t h e sock o b je ct .

3 . Re le a s e s t h e s o cke t b u ffe r a n d t h e s o cke t b u ffe r d e s crip t o r. INET s o cke t s im p le m e n t t h e data_ready m e t h o d b y m e a n s o f t h e sock_def_readable(

) fu n ct io n , wh ich e s s e n t ia lly wa ke s u p a n y p ro ce s s s le e p in g in t h e s o cke t 's wa it q u e u e ( lis t e d in t h e sleep fie ld o f t h e sock o b je ct ) . Th e re is o n e fin a l s t e p t o d e s crib e wh a t h a p p e n s wh e n a p ro ce s s re a d s fro m t h e BS D s o cke t o wn in g o u r INET s o cke t . Th e read( ) s ys t e m ca ll t rig g e rs t h e read m e t h o d o f t h e file o b je ct a s s o cia t e d wit h t h e s o cke t 's s p e cia l file . Th is m e t h o d is im p le m e n t e d b y t h e sock_read( ) fu n ct io n , wh ich in t u rn in vo ke s t h e sock_recvmsg( ) fu n ct io n . Th e la t t e r fu n ct io n is s im ila r t o sock_sendmsg( ) d e s crib e d e a rlie r. Es s e n t ia lly, it in vo ke s t h e

recvmsg m e t h o d o f t h e BS D s o cke t . In t u rn , t h is m e t h o d ( inet_recvmsg( )) in vo ke s t h e recvmsg m e t h o d o f t h e INET s o cke t ; t h a t is , e it h e r t h e tcp_recvmsg( ) o r t h e udp_recvmsg( ) fu n ct io n . Fin a lly, t h e udp_recvmsg( ) fu n ct io n e xe cu t e s t h e fo llo win g a ct io n s :

1 . In vo ke s t h e skb_recv_datagram( ) fu n ct io n t o e xt ra ct t h e firs t p a cke t fro m t h e

receive_queue q u e u e o f t h e INET s o cke t a n d re t u rn t h e a d d re s s o f t h e co rre s p o n d in g s o cke t b u ffe r d e s crip t o r. If t h e q u e u e is e m p t y, t h e fu n ct io n b lo cks t h e cu rre n t p ro ce s s ( u n le s s t h e re a d o p e ra t io n wa s n o t b lo ckin g ) . 2 . If t h e UDP d a t a g ra m ca rrie s a va lid ch e cks u m a n d ch e cks t h a t t h e m e s s a g e h a s n o t b e e n co rru p t e d d u rin g t h e t ra n s m is s io n ( a ct u a lly, t h is s t e p is p e rfo rm e d a t t h e s a m e t im e a s S t e p 3 ) . 3 . Co p ie s t h e p a ylo a d o f t h e UDP d a t a g ra m in t o t h e Us e r Mo d e b u ffe r. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 19. Process Communication Th is ch a p t e r e xp la in s h o w Us e r Mo d e p ro ce s s e s ca n s yn ch ro n ize t h e ir a ct io n s a n d e xch a n g e d a t a . We a lre a d y co ve re d s e ve ra l s yn ch ro n iza t io n t o p ics in Ch a p t e r 5 , b u t t h e a ct o rs t h e re we re ke rn e l co n t ro l p a t h s , n o t Us e r Mo d e p ro g ra m s . We a re n o w re a d y, a ft e r h a vin g d is cu s s e d I/ O m a n a g e m e n t a n d file s ys t e m s a t le n g t h , t o e xt e n d t h e d is cu s s io n t o Us e r Mo d e p ro ce s s e s . Th e s e p ro ce s s e s m u s t re ly o n t h e ke rn e l t o fa cilit a t e in t e rp ro ce s s s yn ch ro n iza t io n a n d co m m u n ica t io n . As we s a w in S e ct io n 1 2 . 7 . 1 , a fo rm o f s yn ch ro n iza t io n a m o n g Us e r Mo d e p ro ce s s e s ca n b e a ch ie ve d b y cre a t in g a ( p o s s ib ly e m p t y) file a n d u s in g s u it a b le VFS s ys t e m ca lls t o lo ck a n d u n lo ck it . Wh ile p ro ce s s e s ca n s im ila rly s h a re d a t a via t e m p o ra ry file s p ro t e ct e d b y lo cks , t h is a p p ro a ch is co s t ly b e ca u s e it re q u ire s a cce s s e s t o t h e d is k file s ys t e m . Fo r t h is re a s o n , a ll Un ix ke rn e ls in clu d e a s e t o f s ys t e m ca lls t h a t s u p p o rt s p ro ce s s co m m u n ica t io n wit h o u t in t e ra ct in g wit h t h e file s ys t e m ; fu rt h e rm o re , s e ve ra l wra p p e r fu n ct io n s we re d e ve lo p e d a n d in s e rt e d in s u it a b le lib ra rie s t o e xp e d it e h o w p ro ce s s e s is s u e t h e ir s yn ch ro n iza t io n re q u e s t s t o t h e ke rn e l. As u s u a l, a p p lica t io n p ro g ra m m e rs h a ve a va rie t y o f n e e d s t h a t ca ll fo r d iffe re n t co m m u n ica t io n m e ch a n is m s . He re a re t h e b a s ic m e ch a n is m s t h a t Un ix s ys t e m s o ffe r t o a llo w in t e rp ro ce s s co m m u n ica t io n : Pip e s a n d FIFOs ( n a m e d p ip e s ) Be s t s u it e d t o im p le m e n t p ro d u ce r/ co n s u m e r in t e ra ct io n s a m o n g p ro ce s s e s . S o m e p ro ce s s e s fill t h e p ip e wit h d a t a , wh ile o t h e rs e xt ra ct d a t a fro m t h e p ip e . S e m a p h o re s Re p re s e n t , a s t h e n a m e im p lie s , t h e Us e r Mo d e ve rs io n o f t h e ke rn e l s e m a p h o re s d is cu s s e d in S e ct io n 5 . 3 . 6 . Me s s a g e s Allo w p ro ce s s e s t o e xch a n g e m e s s a g e s ( s h o rt b lo cks o f d a t a ) b y re a d in g a n d writ in g t h e m in p re d e fin e d m e s s a g e q u e u e s . S h a re d m e m o ry re g io n s Allo w p ro ce s s e s t o e xch a n g e in fo rm a t io n via a s h a re d b lo ck o f m e m o ry. In a p p lica t io n s t h a t m u s t s h a re la rg e a m o u n t s o f d a t a , t h is ca n b e t h e m o s t e fficie n t fo rm o f p ro ce s s co m m u n ica t io n . S o ck e t s Allo w p ro ce s s e s o n d iffe re n t co m p u t e rs t o e xch a n g e d a t a t h ro u g h a n e t wo rk, a s d e s crib e d in Ch a p t e r 1 8 . S o cke t s ca n a ls o b e u s e d a s a co m m u n ica t io n t o o l fo r p ro ce s s e s lo ca t e d o n t h e s a m e h o s t co m p u t e r; t h e X Win d o w S ys t e m g ra p h ic in t e rfa ce , fo r in s t a n ce , u s e s a s o cke t t o a llo w clie n t p ro g ra m s t o e xch a n g e d a t a wit h t h e X s e rve r.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

19.1 Pipes Pip e s a re a n in t e rp ro ce s s co m m u n ica t io n m e ch a n is m t h a t is p ro vid e d in a ll fla vo rs o f Un ix. A p ip e is a o n e - wa y flo w o f d a t a b e t we e n p ro ce s s e s : a ll d a t a writ t e n b y a p ro ce s s t o t h e p ip e is ro u t e d b y t h e ke rn e l t o a n o t h e r p ro ce s s , wh ich ca n t h u s re a d it . In Un ix co m m a n d s h e lls , p ip e s ca n b e cre a t e d b y m e a n s o f t h e | o p e ra t o r. Fo r in s t a n ce , t h e fo llo win g s t a t e m e n t in s t ru ct s t h e s h e ll t o cre a t e t wo p ro ce s s e s co n n e ct e d b y a p ip e :

$ ls | more Th e s t a n d a rd o u t p u t o f t h e firs t p ro ce s s , wh ich e xe cu t e s t h e ls p ro g ra m , is re d ire ct e d t o t h e p ip e ; t h e s e co n d p ro ce s s , wh ich e xe cu t e s t h e m o re p ro g ra m , re a d s it s in p u t fro m t h e p ip e . No t e t h a t t h e s a m e re s u lt s ca n a ls o b e o b t a in e d b y is s u in g t wo co m m a n d s s u ch a s t h e fo llo win g :

$ ls > temp $ more < temp Th e firs t co m m a n d re d ire ct s t h e o u t p u t o f ls in t o a re g u la r file ; t h e n t h e s e co n d co m m a n d fo rce s m o re t o re a d it s in p u t fro m t h e s a m e file . Of co u rs e , u s in g p ip e s in s t e a d o f t e m p o ra ry file s is u s u a lly m o re co n ve n ie n t d u e t o t h e fo llo win g re a s o n s : ● ●

Th e s h e ll s t a t e m e n t is m u ch s h o rt e r a n d s im p le r. Th e re is n o n e e d t o cre a t e t e m p o ra ry re g u la r file s , wh ich m u s t b e d e le t e d la t e r.

19.1.1 Using a Pipe Pip e s m a y b e co n s id e re d o p e n file s t h a t h a ve n o co rre s p o n d in g im a g e in t h e m o u n t e d file s ys t e m s . A p ro ce s s cre a t e s a n e w p ip e b y m e a n s o f t h e pipe( ) s ys t e m ca ll, wh ich re t u rn s a p a ir o f file d e s crip t o rs ; t h e p ro ce s s m a y t h e n p a s s t h e s e d e s crip t o rs t o it s d e s ce n d a n t s t h ro u g h fork( ), t h u s s h a rin g t h e p ip e wit h t h e m . Th e p ro ce s s e s ca n re a d fro m t h e p ip e b y u s in g t h e read( ) s ys t e m ca ll wit h t h e firs t file d e s crip t o r; like wis e , t h e y ca n writ e in t o t h e p ip e b y u s in g t h e write( ) s ys t e m ca ll wit h t h e s e co n d file d e s crip t o r.

POS IX d e fin e s o n ly h a lf- d u p le x p ip e s , s o e ve n t h o u g h t h e pipe( ) s ys t e m ca ll re t u rn s t wo file d e s crip t o rs , e a ch p ro ce s s m u s t clo s e o n e b e fo re u s in g t h e o t h e r. If a t wo - wa y flo w o f d a t a is re q u ire d , t h e p ro ce s s e s m u s t u s e t wo d iffe re n t p ip e s b y in vo kin g pipe( ) t wice .

S e ve ra l Un ix s ys t e m s , s u ch a s S ys t e m V Re le a s e 4 , im p le m e n t fu ll- d u p le x p ip e s . In a fu lld u p le x p ip e , b o t h d e s crip t o rs ca n b e writ t e n in t o a n d re a d fro m , t h u s t h e re a re t wo b id ire ct io n a l ch a n n e ls o f in fo rm a t io n . Lin u x a d o p t s ye t a n o t h e r a p p ro a ch : e a ch p ip e 's file d e s crip t o rs a re s t ill o n e - wa y, b u t it is n o t n e ce s s a ry t o clo s e o n e o f t h e m b e fo re u s in g t h e o t h e r. Le t 's re s u m e t h e p re vio u s e xa m p le . Wh e n t h e co m m a n d s h e ll in t e rp re t s t h e ls|more s t a t e m e n t , it e s s e n t ia lly p e rfo rm s t h e fo llo win g a ct io n s :

1 . In vo ke s t h e pipe( ) s ys t e m ca ll; le t 's a s s u m e t h a t pipe( ) re t u rn s t h e file d e s crip t o rs 3 ( t h e p ip e 's re a d ch a n n e l ) a n d 4 ( t h e w rit e ch a n n e l ) . 2 . In vo ke s t h e fork( ) s ys t e m ca ll t wice .

3 . In vo ke s t h e close( ) s ys t e m ca ll t wice t o re le a s e file d e s crip t o rs 3 a n d 4 .

Th e firs t ch ild p ro ce s s , wh ich m u s t e xe cu t e t h e ls p ro g ra m , p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In vo ke s dup2(4,1) t o co p y file d e s crip t o r 4 t o file d e s crip t o r 1 . Fro m n o w o n , file d e s crip t o r 1 re fe rs t o t h e p ip e 's writ e ch a n n e l. 2 . In vo ke s t h e close( ) s ys t e m ca ll t wice t o re le a s e file d e s crip t o rs 3 a n d 4 .

3 . In vo ke s t h e execve( ) s ys t e m ca ll t o e xe cu t e t h e ls p ro g ra m ( s e e S e ct io n 2 0 . 4 ) . Th e p ro g ra m writ e s it s o u t p u t t o t h e file t h a t h a s file d e s crip t o r 1 ( t h e s t a n d a rd o u t p u t ) ; i. e . , it writ e s in t o t h e p ip e . Th e s e co n d ch ild p ro ce s s m u s t e xe cu t e t h e m o re p ro g ra m ; t h e re fo re , it p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In vo ke s dup2(3,0) t o co p y file d e s crip t o r 3 t o file d e s crip t o r 0 . Fro m n o w o n , file d e s crip t o r 0 re fe rs t o t h e p ip e 's re a d ch a n n e l. 2 . In vo ke s t h e close( ) s ys t e m ca ll t wice t o re le a s e file d e s crip t o rs 3 a n d 4 .

3 . In vo ke s t h e execve( ) s ys t e m ca ll t o e xe cu t e m o re . By d e fa u lt , t h a t p ro g ra m re a d s it s in p u t fro m t h e file t h a t h a s file d e s crip t o r 0 ( t h e s t a n d a rd in p u t ) ; i. e . , it re a d s fro m t h e p ip e . In t h is s im p le e xa m p le , t h e p ip e is u s e d b y e xa ct ly t wo p ro ce s s e s . Be ca u s e o f it s im p le m e n t a t io n , t h o u g h , a p ip e ca n b e u s e d b y a n a rb it ra ry n u m b e r o f p ro ce s s e s . [ 1 ] Cle a rly, if t wo o r m o re p ro ce s s e s re a d o r writ e t h e s a m e p ip e , t h e y m u s t e xp licit ly s yn ch ro n ize t h e ir a cce s s e s b y u s in g file lo ckin g ( s e e S e ct io n 1 2 . 7 . 1 ) o r IPC s e m a p h o re s ( s e e S e ct io n 1 9 . 3 . 3 la t e r in t h is ch a p t e r) . [1]

S in ce m o s t s h e lls o ffe r p ip e s t h a t co n n e ct o n ly t wo p ro ce s s e s , a p p lica t io n s re q u irin g p ip e s u s e d b y m o re t h a n t wo p ro ce s s e s m u s t b e co d e d in a p ro g ra m m in g la n g u a g e s u ch a s C.

Ma n y Un ix s ys t e m s p ro vid e , b e s id e s t h e pipe( ) s ys t e m ca ll, t wo wra p p e r fu n ct io n s n a m e d

popen( ) a n d pclose( ) t h a t h a n d le a ll t h e d irt y wo rk u s u a lly d o n e wh e n u s in g p ip e s . On ce a p ip e h a s b e e n cre a t e d b y m e a n s o f t h e popen( ) fu n ct io n , it ca n b e u s e d wit h t h e h ig h - le ve l I/ O fu n ct io n s in clu d e d in t h e C lib ra ry ( fprintf( ), fscanf( ), a n d s o o n ) . In Lin u x, popen( ) a n d pclose( ) a re in clu d e d in t h e C lib ra ry. Th e popen( ) fu n ct io n

re ce ive s t wo p a ra m e t e rs : t h e filename p a t h n a m e o f a n e xe cu t a b le file a n d a type s t rin g s p e cifyin g t h e d ire ct io n o f t h e d a t a t ra n s fe r. It re t u rn s t h e p o in t e r t o a FILE d a t a s t ru ct u re . Th e popen( ) fu n ct io n e s s e n t ia lly p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Cre a t e s a n e w p ip e b y u s in g t h e pipe( ) s ys t e m ca ll

2 . Fo rks a n e w p ro ce s s , wh ich in t u rn e xe cu t e s t h e fo llo win g o p e ra t io n s : a . If type is r, d u p lica t e s t h e file d e s crip t o r a s s o cia t e d wit h t h e p ip e 's writ e ch a n n e l a s file d e s crip t o r 1 ( s t a n d a rd o u t p u t ) ; o t h e rwis e , if type is w, d u p lica t e s t h e file d e s crip t o r a s s o cia t e d wit h t h e p ip e 's re a d ch a n n e l a s file d e s crip t o r 0 ( s t a n d a rd in p u t ) b . Clo s e s t h e file d e s crip t o rs re t u rn e d b y pipe( )

c. In vo ke s t h e execve( ) s ys t e m ca ll t o e xe cu t e t h e p ro g ra m s p e cifie d b y

filename 3 . If type is r, clo s e s t h e file d e s crip t o r a s s o cia t e d wit h t h e p ip e 's writ e ch a n n e l; o t h e rwis e , if type is w, clo s e s t h e file d e s crip t o r a s s o cia t e d wit h t h e p ip e 's re a d ch a n n e l 4 . Re t u rn s t h e a d d re s s o f t h e FILE file p o in t e r t h a t re fe rs t o wh ich e ve r file d e s crip t o r fo r t h e p ip e is s t ill o p e n Aft e r t h e popen( ) in vo ca t io n , p a re n t a n d ch ild ca n e xch a n g e in fo rm a t io n t h ro u g h t h e p ip e : t h e p a re n t ca n re a d ( if type is r) o r writ e ( if type is w) d a t a b y u s in g t h e FILE p o in t e r re t u rn e d b y t h e fu n ct io n . Th e d a t a is writ t e n t o t h e s t a n d a rd o u t p u t o r re a d fro m t h e s t a n d a rd in p u t , re s p e ct ive ly, b y t h e p ro g ra m e xe cu t e d b y t h e ch ild p ro ce s s . Th e pclose( ) fu n ct io n ( wh ich re ce ive s t h e file p o in t e r re t u rn e d b y popen( ) a s it s p a ra m e t e r) s im p ly in vo ke s t h e wait4( ) s ys t e m ca ll a n d wa it s fo r t h e t e rm in a t io n o f t h e p ro ce s s cre a t e d b y popen( ).

19.1.2 Pipe Data Structures We n o w h a ve t o s t a rt t h in kin g a g a in o n t h e s ys t e m ca ll le ve l. On ce a p ip e is cre a t e d , a p ro ce s s u s e s t h e read( ) a n d write( ) VFS s ys t e m ca lls t o a cce s s it . Th e re fo re , fo r e a ch p ip e , t h e ke rn e l cre a t e s a n in o d e o b je ct p lu s t wo file o b je ct s —o n e fo r re a d in g a n d t h e o t h e r fo r writ in g . Wh e n a p ro ce s s wa n t s t o re a d fro m o r writ e t o t h e p ip e , it m u s t u s e t h e p ro p e r file d e s crip t o r. Wh e n t h e in o d e o b je ct re fe rs t o a p ip e , it s i_pipe fie ld p o in t s t o a pipe_inode_info s t ru ct u re s h o wn in Ta b le 1 9 - 1 .

Ta b le 1 9 - 1 . Th e p ip e _ in o d e _ in fo s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct wait_queue * wait

Pip e / FIFO wa it q u e u e

char *

base

Ad d re s s o f ke rn e l b u ffe r

unsigned int

len

Nu m b e r o f b yt e s writ t e n in t o t h e b u ffe r a n d ye t t o b e re a d

unsigned int

start

Re a d p o s it io n in ke rn e l b u ffe r

unsigned int

readers

Fla g fo r ( o r n u m b e r o f) re a d in g p ro ce s s e s

unsigned int

writers

Fla g fo r ( o r n u m b e r o f) writ in g p ro ce s s e s

unsigned int

waiting_readers Nu m b e r o f re a d in g p ro ce s s e s s le e p in g in t h e wa it q u e u e

unsigned int

waiting_writers Nu m b e r o f writ in g p ro ce s s e s s le e p in g in t h e wa it q u e u e

unsigned int

r_counter

Like readers, b u t u s e d wh e n wa it in g fo r a p ro ce s s t h a t re a d s fro m t h e FIFO

unsigned int

w_counter

Like writers, b u t u s e d wh e n wa it in g fo r a p ro ce s s t h a t writ e s in t o t h e FIFO

Be s id e s o n e in o d e a n d t wo file o b je ct s , e a ch p ip e h a s it s o wn p ip e b u ffe r—a s in g le p a g e fra m e co n t a in in g t h e d a t a writ t e n in t o t h e p ip e a n d ye t t o b e re a d . Th e a d d re s s o f t h is p a g e fra m e is s t o re d in t h e base fie ld o f t h e pipe_inode_info s t ru ct u re . Th e len fie ld o f t h e s t ru ct u re s t o re s t h e n u m b e r o f b yt e s writ t e n in t o t h e p ip e b u ffe r t h a t a re ye t t o b e re a d ; in t h e fo llo win g , we ca ll t h a t n u m b e r t h e cu rre n t p ip e s iz e . Th e p ip e b u ffe r is circu la r a n d it is a cce s s e d b o t h b y re a d in g a n d writ in g p ro ce s s e s , s o t h e ke rn e l m u s t ke e p t ra ck o f t wo cu rre n t p o s it io n s in t h e b u ffe r: ●

Th e o ffs e t o f t h e n e xt b yt e t o b e re a d , wh ich is s t o re d in t h e start fie ld o f t h e

pipe_inode_info s t ru ct u re ●

Th e o ffs e t o f t h e n e xt b yt e t o b e writ t e n , wh ich is d e rive d fro m start a n d t h e p ip e s ize ( t h e len fie ld o f t h e s t ru ct u re )

To a vo id ra ce co n d it io n s o n t h e p ip e 's d a t a s t ru ct u re s , t h e ke rn e l p re ve n t s co n cu rre n t a cce s s e s t o t h e p ip e b u ffe r t h ro u g h t h e u s e o f t h e i_sem s e m a p h o re in clu d e d in t h e in o d e o b je ct .

19.1.2.1 The pipefs special filesystem A p ip e is im p le m e n t e d a s a s e t o f VFS o b je ct s , wh ich h a ve n o co rre s p o n d in g d is k im a g e . In Lin u x 2 . 4 , t h e s e VFS o b je ct s a re o rg a n ize d in t o t h e p ip e fs s p e cia l file s ys t e m t o e xp e d it e t h e ir h a n d lin g ( s e e S e ct io n 1 2 . 3 . 1 ) . S in ce t h is file s ys t e m h a s n o m o u n t p o in t in t h e s ys t e m d ire ct o ry t re e , u s e rs n e ve r s e e it . Ho we ve r, t h a n ks t o p ip e fs , t h e p ip e s a re fu lly in t e g ra t e d in t h e VFS la ye r, a n d t h e ke rn e l ca n h a n d le t h e m in t h e s a m e wa y a s n a m e d p ip e s o r FIFOs , wh ich t ru ly e xis t a s file s re co g n iza b le t o e n d u s e rs ( s e e t h e la t e r s e ct io n S e ct io n 1 9 . 2 ) . Th e init_pipe_fs( ) fu n ct io n , t yp ica lly e xe cu t e d d u rin g ke rn e l in it ia liza t io n , re g is t e rs t h e p ip e fs file s ys t e m a n d m o u n t s it ( re fe r t o t h e d is cu s s io n in S e ct io n 1 2 . 4 . 1 ) :

struct file_system_type pipe_fs_type; root_fs_type.name = "pipefs"; root_fs_type.read_super = pipefs_read_super; root_fs_type.fs_flags = FS_NOMOUNT; register_filesystem(&pipe_fs_type); pipe_mnt = do_kern_mount("pipefs", 0, "pipefs", NULL); Th e m o u n t e d file s ys t e m o b je ct t h a t re p re s e n t s t h e ro o t d ire ct o ry o f p ip e fs is s t o re d in t h e pipe_mnt va ria b le .

19.1.3 Creating and Destroying a Pipe Th e pipe( ) s ys t e m ca ll is s e rvice d b y t h e sys_pipe( ) fu n ct io n , wh ich in t u rn in vo ke s t h e do_pipe( ) fu n ct io n . To cre a t e a n e w p ip e , do_pipe( ) p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In vo ke s t h e get_pipe_inode( ) fu n ct io n , wh ich a llo ca t e s a n d in it ia lize s a n in o d e o b je ct fo r t h e p ip e in t h e p ip e fs file s ys t e m . In p a rt icu la r, t h is fu n ct io n e xe cu t e s t h e fo llo win g a ct io n s : a . Allo ca t e s a pipe_inode_info d a t a s t ru ct u re a n d s t o re s it s a d d re s s in t h e

i_pipe fie ld o f t h e in o d e . b . Allo ca t e s a p a g e fra m e fo r t h e p ip e b u ffe r a n d s t o re s it s s t a rt in g a d d re s s in t h e base fie ld o f t h e pipe_inode_info s t ru ct u re .

c. In it ia lize s t h e start, len, waiting_readers, a n d waiting_writers fie ld s o f t h e pipe_inode_info s t ru ct u re t o 0 .

d . In it ia lize s t h e r_counter a n d w_counter fie ld s o f t h e pipe_inode_info s t ru ct u re t o 1 . 2 . S e t s t h e readers a n d writers fie ld s o f t h e pipe_inode_info s t ru ct u re t o 1 .

3 . Allo ca t e s a file o b je ct a n d a file d e s crip t o r fo r t h e re a d ch a n n e l o f t h e p ip e , s e t s t h e flag fie ld o f t h e file o b je ct t o O_RDONLY, a n d in it ia lize s t h e f_op fie ld wit h t h e a d d re s s o f t h e read_ pipe_fops t a b le .

4 . Allo ca t e s a file o b je ct a n d a file d e s crip t o r fo r t h e writ e ch a n n e l o f t h e p ip e , s e t s t h e flag fie ld o f t h e file o b je ct t o O_WRONLY, a n d in it ia lize s t h e f_op fie ld wit h t h e a d d re s s o f t h e write_ pipe_fops t a b le .

5 . Allo ca t e s a d e n t ry o b je ct a n d u s e s it t o lin k t h e t wo file o b je ct s a n d t h e in o d e o b je ct ( s e e S e ct io n 1 2 . 1 . 1 ) ; t h e n in s e rt s t h e n e w in o d e in t h e p ip e fs s p e cia l file s ys t e m . 6 . Re t u rn s t h e t wo file d e s crip t o rs t o t h e Us e r Mo d e p ro ce s s . Th e p ro ce s s t h a t is s u e s a pipe( ) s ys t e m ca ll is in it ia lly t h e o n ly p ro ce s s t h a t ca n a cce s s t h e n e w p ip e , b o t h fo r re a d in g a n d writ in g . To re p re s e n t t h a t t h e p ip e h a s b o t h a re a d e r a n d a writ e r, t h e readers a n d writers fie ld s o f t h e pipe_inode_info d a t a s t ru ct u re a re in it ia lize d t o 1 . In g e n e ra l, e a ch o f t h e s e t wo fie ld s is s e t t o 1 o n ly if t h e co rre s p o n d in g p ip e 's file o b je ct is s t ill o p e n e d b y a p ro ce s s ; t h e fie ld is s e t t o 0 if t h e co rre s p o n d in g file o b je ct h a s b e e n re le a s e d , s in ce it is n o lo n g e r a cce s s e d b y a n y p ro ce s s . Fo rkin g a n e w p ro ce s s d o e s n o t in cre a s e t h e va lu e o f t h e readers a n d writers fie ld s , s o t h e y n e ve r ris e a b o ve 1 ; [ 2 ] h o we ve r, it d o e s in cre a s e t h e va lu e o f t h e u s a g e co u n t e rs o f a ll file o b je ct s s t ill u s e d b y t h e p a re n t p ro ce s s ( s e e S e ct io n 3 . 4 . 1 ) . Th u s , t h e o b je ct s a re n o t re le a s e d e ve n wh e n t h e p a re n t d ie s , a n d t h e p ip e s t a ys o p e n fo r u s e b y t h e ch ild re n . [2]

As we 'll s e e , t h e readers a n d writers fie ld s a ct a s co u n t e rs in s t e a d o f fla g s wh e n a s s o cia t e d wit h FIFOs . Wh e n e ve r a p ro ce s s in vo ke s t h e close( ) s ys t e m ca ll o n a file d e s crip t o r a s s o cia t e d wit h a p ip e , t h e ke rn e l e xe cu t e s t h e fput( ) fu n ct io n o n t h e co rre s p o n d in g file o b je ct , wh ich d e cre m e n t s t h e u s a g e co u n t e r. If t h e co u n t e r b e co m e s 0 , t h e fu n ct io n in vo ke s t h e release m e t h o d o f t h e file o p e ra t io n s ( s e e S e ct io n 1 2 . 6 . 3 a n d S e ct io n 1 2 . 2 . 6 ) . Acco rd in g t o wh e t h e r t h e file is a s s o cia t e d wit h t h e re a d o r writ e ch a n n e l, t h e release m e t h o d is im p le m e n t e d b y e it h e r pipe_read_release( ) o r pipe_write_release( ); b o t h fu n ct io n s in vo ke pipe_release( ), wh ich s e t s e it h e r t h e readers fie ld o r t h e

writers fie ld o f t h e pipe_inode_info s t ru ct u re t o 0 . Th e fu n ct io n ch e cks wh e t h e r b o t h t h e readers a n d writers fie ld s a re e q u a l t o 0 ; in t h is ca s e , it re le a s e s t h e p a g e fra m e co n t a in in g t h e p ip e b u ffe r. Ot h e rwis e , t h e fu n ct io n wa ke s u p a n y p ro ce s s e s s le e p in g in t h e p ip e 's wa it q u e u e s o t h e y ca n re co g n ize t h e ch a n g e in t h e p ip e s t a t e .

19.1.4 Reading from a Pipe A p ro ce s s wis h in g t o g e t d a t a fro m a p ip e is s u e s a read( ) s ys t e m ca ll, s p e cifyin g t h e file d e s crip t o r a s s o cia t e d wit h t h e p ip e 's re a d in g e n d . As d e s crib e d in S e ct io n 1 2 . 6 . 2 , t h e ke rn e l e n d s u p in vo kin g t h e read m e t h o d fo u n d in t h e file o p e ra t io n t a b le a s s o cia t e d wit h t h e p ro p e r file o b je ct . In t h e ca s e o f a p ip e , t h e e n t ry fo r t h e re a d m e t h o d in t h e read_pipe_fops t a b le p o in t s t o t h e pipe_read( ) fu n ct io n .

Th e pipe_read( ) fu n ct io n is q u it e in vo lve d , s in ce t h e POS IX s t a n d a rd s p e cifie s s e ve ra l re q u ire m e n t s fo r t h e p ip e 's re a d o p e ra t io n s . Ta b le 1 9 - 2 s u m m a rize s t h e e xp e ct e d b e h a vio r

o f a read( ) s ys t e m ca ll t h a t re q u e s t s n b yt e s fro m a p ip e t h a t h a s a p ip e s ize ( n u m b e r o f b yt e s in t h e p ip e b u ffe r ye t t o b e re a d ) e q u a l t o p . Th e s ys t e m ca ll m ig h t b lo ck t h e cu rre n t p ro ce s s in t wo ca s e s : ● ●

Th e p ip e b u ffe r is e m p t y wh e n t h e s ys t e m ca ll s t a rt s . Th e p ip e b u ffe r d o e s n o t in clu d e a ll re q u e s t e d b yt e s , a n d a writ in g p ro ce s s wa s p re vio u s ly p u t t o s le e p wh ile wa it in g fo r s p a ce in t h e b u ffe r.

No t ice t h a t t h e re a d o p e ra t io n ca n b e n o n b lo ckin g : in t h is ca s e , it co m p le t e s a s s o o n a s a ll a va ila b le b yt e s ( e ve n n o n e ) a re co p ie d in t o t h e u s e r a d d re s s s p a ce . [ 3 ] [3]

No n b lo ckin g o p e ra t io n s a re u s u a lly re q u e s t e d b y s p e cifyin g t h e O_NONBLOCK fla g in t h e open( ) s ys t e m ca ll. Th is m e t h o d d o e s n o t wo rk fo r p ip e s , s in ce t h e y ca n n o t b e o p e n e d . A p ro ce s s ca n , h o we ve r, re q u ire a n o n b lo ckin g o p e ra t io n o n a p ip e b y is s u in g a fcntl( ) s ys t e m ca ll o n t h e co rre s p o n d in g file d e s crip t o r. No t ice a ls o t h a t t h e va lu e 0 is re t u rn e d b y t h e read( ) s ys t e m ca ll o n ly if t h e p ip e is e m p t y a n d n o p ro ce s s is cu rre n t ly u s in g t h e file o b je ct a s s o cia t e d wit h t h e p ip e 's writ e ch a n n e l.

Ta b le 1 9 - 2 . Re a d in g n b y t e s fro m a p ip e

N o w rit in g p ro c e s s

At le a s t o n e w rit in g p ro c e s s

Blo c k in g re a d

P ip e S iz e S le e p in g w rit e r p

N o s le e p in g w rit e r

Co p y n b yt e s a n d Wa it fo r s o m e d a t a , re t u rn n , wa it in g fo r co p y it , a n d re t u rn Re t u rn -EAGAIN. d a t a wh e n t h e p ip e it s s ize . b u ffe r is e m p t y.

p = 0

0 < p < n

p

N o n b lo c k in g re a d

n

Re t u rn 0 .

Co p y p b yt e s a n d re t u rn p : b yt e s a re le ft in t h e p ip e b u ffe r.

Co p y n b yt e s a n d re t u rn n : p - n b yt e s a re le ft in t h e p ip e b u ffe r.

Th e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Acq u ire s t h e i_sem s e m a p h o re o f t h e in o d e .

2 . De t e rm in e s wh e t h e r t h e p ip e s ize , wh ich is s t o re d in t o t h e len fie ld o f t h e

pipe_inode_info s t ru ct u re , is 0 . In t h is ca s e , d e t e rm in e s wh e t h e r t h e fu n ct io n m u s t re t u rn o r wh e t h e r t h e p ro ce s s m u s t b e b lo cke d wh ile wa it in g u n t il a n o t h e r p ro ce s s writ e s s o m e d a t a in t h e p ip e ( s e e Ta b le 1 9 - 2 ) . Th e t yp e o f I/ O o p e ra t io n ( b lo ckin g o r n o n b lo ckin g ) is s p e cifie d b y t h e O_NONBLOCK fla g in t h e f_flags fie ld o f t h e file o b je ct . If t h e cu rre n t p ro ce s s m u s t b e b lo cke d , t h e fu n ct io n p e rfo rm s t h e fo llo win g a ct io n s : a . Ad d s 1 t o t h e waiting_readers fie ld o f t h e pipe_inode_info s t ru ct u re .

b . Ad d s current t o t h e wa it q u e u e o f t h e p ip e ( t h e wait fie ld o f t h e

pipe_inode_info s t ru ct u re ) . c. Re le a s e s t h e in o d e s e m a p h o re . d . S e t s t h e p ro ce s s s t a t u s t o TASK_INTERRUPTIBLE a n d in vo ke s schedule(

). e . On ce a wa ke , re m o ve s current fro m t h e wa it q u e u e , a cq u ire s a g a in t h e

i_sem in o d e s e m a p h o re , d e cre m e n t s t h e waiting_readers fie ld , a n d t h e n ju m p s b a ck t o S t e p 2 . 3 . Co p ie s t h e re q u e s t e d n u m b e r o f b yt e s ( o r t h e n u m b e r o f a va ila b le b yt e s , if t h e b u ffe r s ize is t o o s m a ll) fro m t h e p ip e 's b u ffe r t o t h e u s e r a d d re s s s p a ce . 4 . Up d a t e s t h e start a n d len fie ld s o f t h e pipe_inode_info s t ru ct u re .

5 . In vo ke s wake_up_interruptible( ) t o wa ke u p a ll p ro ce s s e s s le e p in g o n t h e p ip e 's wa it q u e u e . 6 . If n o t a ll re q u e s t e d b yt e s h a ve b e e n co p ie d , t h e re is a t le a s t o n e writ in g p ro ce s s cu rre n t ly s le e p in g ( waiting_writers fie ld g re a t e r t h a n 0 ) a n d t h e re a d o p e ra t io n is n o n b lo ckin g , s o t h e fu n ct io n ju m p s b a ck t o S t e p 2 . 7 . Re le a s e s t h e i_sem s e m a p h o re o f t h e in o d e .

8 . Re t u rn s t h e n u m b e r o f b yt e s co p ie d in t o t h e u s e r a d d re s s s p a ce .

19.1.5 Writing into a Pipe A p ro ce s s wis h in g t o p u t d a t a in t o a p ip e is s u e s a write( ) s ys t e m ca ll, s p e cifyin g t h e file d e s crip t o r fo r t h e writ in g e n d o f t h e p ip e . Th e ke rn e l s a t is fie s t h is re q u e s t b y in vo kin g t h e write m e t h o d o f t h e p ro p e r file o b je ct ; t h e co rre s p o n d in g e n t ry in t h e write_pipe_fops t a b le p o in t s t o t h e pipe_write( ) fu n ct io n .

Ta b le 1 9 - 3 s u m m a rize s t h e b e h a vio r, s p e cifie d b y t h e POS IX s t a n d a rd , o f a write( ) s ys t e m ca ll t h a t re q u e s t e d t o writ e n b yt e s in t o a p ip e h a vin g u u n u s e d b yt e s in it s b u ffe r. In

p a rt icu la r, t h e s t a n d a rd re q u ire s t h a t writ e o p e ra t io n s in vo lvin g a s m a ll n u m b e r o f b yt e s m u s t b e a t o m ica lly e xe cu t e d . Mo re p re cis e ly, if t wo o r m o re p ro ce s s e s a re co n cu rre n t ly writ in g in t o a p ip e , e a ch writ e o p e ra t io n in vo lvin g fe we r t h a n 4 , 0 9 6 b yt e s ( t h e p ip e b u ffe r s ize ) m u s t fin is h wit h o u t b e in g in t e rle a ve d wit h writ e o p e ra t io n s o f o t h e r p ro ce s s e s t o t h e s a m e p ip e . Ho we ve r, writ e o p e ra t io n s in vo lvin g m o re t h a n 4 , 0 9 6 b yt e s m a y b e n o n a t o m ic a n d m a y a ls o fo rce t h e ca llin g p ro ce s s t o s le e p .

Ta b le 1 9 - 3 . W rit in g n b y t e s t o a p ip e

At le a s t o n e re a d in g p ro c e s s

Av a ila b le b u ffe r s pace u

u< n

4096

N o n b lo c k in g w rit e

Wa it u n t il n - u b yt e s Re t u rn -EAGAIN. a re fre e d , co p y n b yt e s , a n d re t u rn n .

Co p y n b yt e s ( wa it in g wh e n n e ce s s a ry) a n d re t u rn n .

n> 4096

u

Blo c k in g w rit e

N o re a d in g p ro c e s s

S e n d SIGPIPE s ig n a l a n d re t u rn -

EPIPE.

If u > 0 , co p y u b yt e s a n d re t u rn u ; e ls e re t u rn -

EAGAIN.

Co p y n b yt e s a n d re t u rn n .

n

Mo re o ve r, e a ch writ e o p e ra t io n t o a p ip e m u s t fa il if t h e p ip e d o e s n o t h a ve a re a d in g p ro ce s s ( t h a t is , if t h e readers fie ld o f t h e p ip e 's in o d e o b je ct h a s t h e va lu e 0 ) . In t h is ca s e , t h e ke rn e l s e n d s a SIGPIPE s ig n a l t o t h e writ in g p ro ce s s a n d t e rm in a t e s t h e write(

) s ys t e m ca ll wit h t h e -EPIPE e rro r co d e , wh ich u s u a lly le a d s t o t h e fa m ilia r "Bro ke n p ip e " m essage. Th e pipe_write( ) fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Acq u ire s t h e i_sem s e m a p h o re o f t h e in o d e .

2 . Ch e cks wh e t h e r t h e p ip e h a s a t le a s t o n e re a d in g p ro ce s s . If n o t , it s e n d s a SIGPIPE s ig n a l t o t h e current p ro ce s s , re le a s e s t h e in o d e s e m a p h o re , a n d re t u rn s a n -EPIPE va lu e .

3 . Ch e cks wh e t h e r t h e n u m b e r o f b yt e s t o b e writ t e n is wit h in t h e p ip e 's b u ffe r s ize : a . If s o , t h e writ e o p e ra t io n m u s t b e a t o m ic. Th e re fo re , ch e cks wh e t h e r t h e b u ffe r h a s e n o u g h fre e s p a ce t o s t o re a ll b yt e s t o b e writ t e n . b . Ot h e rwis e , if t h e n u m b e r o f b yt e s is g re a t e r t h a n t h e b u ffe r s ize , t h e

o p e ra t io n ca n s t a rt a s lo n g a s t h e re is a n y fre e s p a ce a t a ll. Th e re fo re , t h e fu n ct io n ch e cks fo r a t le a s t o n e fre e b yt e . 4 . If t h e b u ffe r d o e s n o t h a ve e n o u g h fre e s p a ce a n d t h e writ e o p e ra t io n is n o n b lo ckin g , re le a s e s t h e in o d e s e m a p h o re a n d re t u rn s t h e -EAGAIN e rro r co d e .

5 . If t h e b u ffe r d o e s n o t h a ve e n o u g h fre e s p a ce a n d t h e writ e o p e ra t io n is b lo ckin g , p e rfo rm s t h e fo llo win g a ct io n s : a . Ad d s 1 t o t h e waiting_writers fie ld o f t h e pipe_inode_info s t ru ct u re .

b . Ad d s current t o t h e wa it q u e u e o f t h e p ip e ( t h e wait fie ld o f t h e

pipe_inode_info s t ru ct u re ) . c. Re le a s e s t h e in o d e s e m a p h o re . d . S e t s t h e p ro ce s s s t a t u s t o TASK_INTERRUPTIBLE a n d in vo ke s schedule(

). e . On ce a wa ke , re m o ve s current fro m t h e wa it q u e u e , a g a in a cq u ire s t h e in o d e s e m a p h o re , d e cre m e n t s t h e waiting_writers fie ld , a n d t h e n ju m p s b a ck t o S t e p 5 . 6 . No w t h e p ip e b u ffe r h a s e n o u g h fre e s p a ce t o e it h e r co p y t h e re q u e s t e d n u m b e r o f b yt e s ( if t h e writ e o p e ra t io n m u s t b e a t o m ic) o r co p y a t le a s t o n e b yt e ; n o t ice t h a t o t h e r writ e rs ca n n o t s t e a l fre e s p a ce b e ca u s e t h is writ e r o wn s t h e in o d e s e m a p h o re . Co p ie s t h e re q u e s t e d n u m b e r o f b yt e s ( o r t h e n u m b e r o f fre e b yt e s if t h e p ip e s ize is t o o s m a ll) fro m t h e u s e r a d d re s s s p a ce t o t h e p ip e 's b u ffe r. 7 . Wa ke s u p a ll p ro ce s s e s s le e p in g o n t h e p ip e 's wa it q u e u e . 8 . If t h e writ e o p e ra t io n wa s b lo ckin g a n d n o t a ll re q u e s t e d b yt e s we re writ t e n in t h e p ip e b u ffe r, ju m p s b a ck t o S t e p 5 . No t ice t h a t t h is ca s e m a y o ccu r o n ly wh e n t h e writ e o p e ra t io n is n o n a t o m ic; h e n ce t h e cu rre n t p ro ce s s re m a in s b lo cke d u n t il o n e o r m o re b yt e s o f t h e p ip e b u ffe r a re fre e d . 9 . Re le a s e s t h e in o d e s e m a p h o re . 1 0 . Re t u rn s t h e n u m b e r o f b yt e s writ t e n in t o t h e p ip e 's b u ffe r.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

19.2 FIFOs Alt h o u g h p ip e s a re a s im p le , fle xib le , a n d e fficie n t co m m u n ica t io n m e ch a n is m , t h e y h a ve o n e m a in d ra wb a ck—n a m e ly, t h a t t h e re is n o wa y t o o p e n a n a lre a d y e xis t in g p ip e . Th is m a ke s it im p o s s ib le fo r t wo a rb it ra ry p ro ce s s e s t o s h a re t h e s a m e p ip e , u n le s s t h e p ip e wa s cre a t e d b y a co m m o n a n ce s t o r p ro ce s s . Th is d ra wb a ck is s u b s t a n t ia l fo r m a n y a p p lica t io n p ro g ra m s . Co n s id e r, fo r in s t a n ce , a d a t a b a s e e n g in e s e rve r, wh ich co n t in u o u s ly p o lls clie n t p ro ce s s e s wis h in g t o is s u e s o m e q u e rie s a n d wh ich s e n d s t h e re s u lt s o f t h e d a t a b a s e lo o ku p s b a ck t o t h e m . Ea ch in t e ra ct io n b e t we e n t h e s e rve r a n d a g ive n clie n t m ig h t b e h a n d le d b y a p ip e . Ho we ve r, clie n t p ro ce s s e s a re u s u a lly cre a t e d o n d e m a n d b y a co m m a n d s h e ll wh e n a u s e r e xp licit ly q u e rie s t h e d a t a b a s e ; s e rve r a n d clie n t p ro ce s s e s t h u s ca n n o t e a s ily s h a re a p ip e . To a d d re s s s u ch lim it a t io n s , Un ix s ys t e m s in t ro d u ce a s p e cia l file t yp e ca lle d a n a m e d p ip e o r FIFO ( wh ich s t a n d s fo r "firs t in , firs t o u t "; t h e firs t b yt e writ t e n in t o t h e s p e cia l file is a ls o t h e firs t b yt e t h a t is re a d ) . An y FIFO is m u ch like a p ip e : ra t h e r t h a n o wn in g d is k b lo cks in t h e file s ys t e m s , a n o p e n e d FIFO is a s s o cia t e d wit h a ke rn e l b u ffe r t h a t t e m p o ra rily s t o re s t h e d a t a e xch a n g e d b y t wo o r m o re p ro ce s s e s . Th a n ks t o t h e d is k in o d e , h o we ve r, a FIFO ca n b e a cce s s e d b y a n y p ro ce s s , s in ce t h e FIFO file n a m e is in clu d e d in t h e s ys t e m 's d ire ct o ry t re e . Th u s , in o u r e xa m p le , t h e co m m u n ica t io n b e t we e n s e rve r a n d clie n t s m a y b e e a s ily e s t a b lis h e d b y u s in g FIFOs in s t e a d o f p ip e s . Th e s e rve r cre a t e s , a t s t a rt u p , a FIFO u s e d b y clie n t p ro g ra m s t o m a ke t h e ir re q u e s t s . Ea ch clie n t p ro g ra m cre a t e s , b e fo re e s t a b lis h in g t h e co n n e ct io n , a n o t h e r FIFO t o wh ich t h e s e rve r p ro g ra m ca n writ e t h e a n s we r t o t h e q u e ry a n d in clu d e s t h e FIFO's n a m e in t h e in it ia l re q u e s t t o t h e s e rve r. In Lin u x 2 . 4 , FIFOs a n d p ip e s a re a lm o s t id e n t ica l a n d u s e t h e s a m e pipe_inode_info s t ru ct u re s . As a m a t t e r o f fa ct , t h e read a n d write file o p e ra t io n m e t h o d s o f a FIFO a re im p le m e n t e d b y t h e s a m e pipe_read( ) a n d pipe_write( ) fu n ct io n s d e s crib e d in t h e e a rlie r s e ct io n s S e ct io n 1 9 . 1 . 4 a n d S e ct io n 1 9 . 1 . 5 . Act u a lly, t h e re a re o n ly t wo s ig n ifica n t d iffe re n ce s : ●



FIFO in o d e s a p p e a r o n t h e s ys t e m d ire ct o ry t re e ra t h e r t h a n o n t h e p ip e fs s p e cia l file s ys t e m . FIFOs a re a b id ire ct io n a l co m m u n ica t io n ch a n n e l; t h a t is , it is p o s s ib le t o o p e n a FIFO in re a d / writ e m o d e .

To co m p le t e o u r d e s crip t io n , t h e re fo re , we ju s t h a ve t o e xp la in h o w FIFOs a re cre a t e d a n d ope ne d.

19.2.1 Creating and Opening a FIFO A p ro ce s s cre a t e s a FIFO b y is s u in g a mknod( )[4] s ys t e m ca ll ( s e e S e ct io n 1 3 . 2 ) , p a s s in g t o it a s p a ra m e t e rs t h e p a t h n a m e o f t h e n e w FIFO a n d t h e va lu e S_IFIFO ( 0x1000) lo g ica lly ORe d wit h t h e p e rm is s io n b it m a s k o f t h e n e w file . POS IX in t ro d u ce s a fu n ct io n n a m e d mkfifo( ) s p e cifica lly t o cre a t e a FIFO. Th is ca ll is im p le m e n t e d in Lin u x, a s in S ys t e m V Re le a s e 4 , a s a C lib ra ry fu n ct io n t h a t in vo ke s mknod( ).

[4]

In fa ct , mknod( ) ca n b e u s e d t o cre a t e n e a rly a n y kin d o f file , s u ch a s b lo ck a n d ch a ra ct e r d e vice file s , FIFOs , a n d e ve n re g u la r file s ( it ca n n o t cre a t e d ire ct o rie s o r s o cke t s , t h o u g h ) . On ce cre a t e d , a FIFO ca n b e a cce s s e d t h ro u g h t h e u s u a l open( ), read( ), write( ), a n d close( ) s ys t e m ca lls , b u t t h e VFS h a n d le s it in a s p e cia l wa y b e ca u s e t h e FIFO in o d e a n d file o p e ra t io n s a re cu s t o m ize d a n d d o n o t d e p e n d o n t h e file s ys t e m s in wh ich t h e FIFO is s t o re d . Th e POS IX s t a n d a rd s p e cifie s t h e b e h a vio r o f t h e open( ) s ys t e m ca ll o n FIFOs ; t h e b e h a vio r d e p e n d s e s s e n t ia lly o n t h e re q u e s t e d a cce s s t yp e , t h e kin d o f I/ O o p e ra t io n ( b lo ckin g o r n o n b lo ckin g ) , a n d t h e p re s e n ce o f o t h e r p ro ce s s e s a cce s s in g t h e FIFO. A p ro ce s s m a y o p e n a FIFO fo r re a d in g , fo r writ in g , o r fo r re a d in g a n d writ in g . Th e file o p e ra t io n s a s s o cia t e d wit h t h e co rre s p o n d in g file o b je ct a re s e t t o s p e cia l m e t h o d s fo r t h e s e t h re e ca s e s . Wh e n a p ro ce s s o p e n s a FIFO, t h e VFS p e rfo rm s t h e s a m e o p e ra t io n s a s it d o e s fo r d e vice file s ( s e e S e ct io n 1 3 . 2 . 3 ) . Th e in o d e o b je ct a s s o cia t e d wit h t h e o p e n e d FIFO is in it ia lize d b y a file s ys t e m - d e p e n d e n t read_inode s u p e rb lo ck m e t h o d ; t h is m e t h o d a lwa ys ch e cks wh e t h e r t h e in o d e o n d is k re p re s e n t s a s p e cia l file , a n d in vo ke s if n e ce s s a ry t h e init_special_inode( ) fu n ct io n . It t u rn , t h is fu n ct io n s e t s t h e i_fop fie ld o f t h e in o d e o b je ct t o t h e a d d re s s o f t h e def_fifo_fops t a b le . La t e r, t h e ke rn e l s e t s t h e file o p e ra t io n t a b le o f t h e file o b je ct t o def_fifo_fops, a n d e xe cu t e s it s open m e t h o d , wh ich is im p le m e n t e d b y fifo_open( ).

Th e fifo_open( ) fu n ct io n in it ia lize s t h e d a t a s t ru ct u re s s p e cific t o t h e FIFO; in p a rt icu la r, it p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Acq u ire s t h e i_sem in o d e s e m a p h o re .

2 . Ch e cks t h e i_pipe fie ld o f t h e in o d e o b je ct ; if it is NULL, it a llo ca t e s a n d in it ia lize s a n e w pipe_inode_info s t ru ct u re , a s in S t e p 1 in t h e e a rlie r s e ct io n S e ct io n 1 9 . 1 . 3 .

3 . De p e n d in g o n t h e a cce s s m o d e s p e cifie d a s t h e p a ra m e t e r o f t h e open( ) s ys t e m ca ll, it in it ia lize s t h e f_op fie ld o f t h e file o b je ct wit h t h e a d d re s s o f t h e p ro p e r file o p e ra t io n t a b le ( s e e Ta b le 1 9 - 4 ) .

Ta b le 1 9 - 4 . FI FO's file o p e ra t io n s

Ac c e s s t y p e

File o p e ra t io n s

re a d m e t h o d

w rit e m e t h o d

Re a d - o n ly

read_fifo_fops

pipe_read( )

bad_pipe_w( )

Writ e - o n ly

write_fifo_fops

bad_pipe_r( )

pipe_write( )

rdwr_fifo_fops

Re a d / writ e

pipe_read( )

pipe_write( )

4 . If t h e a cce s s m o d e is e it h e r re a d - o n ly o r re a d / writ e , it a d d s o n e t o t h e readers a n d

r_counter fie ld s o f t h e pipe_inode_info s t ru ct u re . Mo re o ve r, if t h e a cce s s m o d e is re a d - o n ly a n d t h e re is n o o t h e r re a d in g p ro ce s s , it wa ke s u p a n y writ in g p ro ce s s s le e p in g in t h e wa it q u e u e . 5 . If t h e a cce s s m o d e is e it h e r writ e - o n ly o r re a d / writ e , it a d d s o n e t o t h e writers a n d

w_counter fie ld s o f t h e pipe_inode_info s t ru ct u re . Mo re o ve r, if t h e a cce s s m o d e is writ e - o n ly a n d t h e re is n o o t h e r writ in g p ro ce s s , it wa ke s u p a n y re a d in g p ro ce s s s le e p in g in t h e wa it q u e u e . 6 . If t h e re a re n o re a d e rs o r n o writ e rs , it d e cid e s wh e t h e r t h e fu n ct io n s h o u ld b lo ck o r t e rm in a t e re t u rn in g a n e rro r co d e ( s e e Ta b le 1 9 - 5 ) .

Ta b le 1 9 - 5 . Be h a v io r o f t h e fifo _ o p e n ( ) fu n c t io n

Ac c e s s t y p e

Blo c k in g

N o n b lo c k in g

Re a d - o n ly, wit h writ e rs

S u cce s s fu lly re t u rn

S u cce s s fu lly re t u rn

Re a d - o n ly, n o writ e r

Wa it fo r a writ e r

S u cce s s fu lly re t u rn

Writ e - o n ly, wit h re a d e rs

S u cce s s fu lly re t u rn

S u cce s s fu lly re t u rn

Writ e - o n ly, n o re a d e r

Wa it fo r a re a d e r

Re t u rn -ENXIO

Re a d / writ e

S u cce s s fu lly re t u rn

S u cce s s fu lly re t u rn

7 . Re le a s e s t h e in o d e s e m a p h o re , a n d t e rm in a t e s , re t u rn in g 0 ( s u cce s s ) . Th e FIFO's t h re e s p e cia lize d file o p e ra t io n t a b le s d iffe r m a in ly in t h e im p le m e n t a t io n o f t h e read a n d write m e t h o d s . If t h e a cce s s t yp e a llo ws re a d o p e ra t io n s , t h e read m e t h o d is im p le m e n t e d b y t h e pipe_read( ) fu n ct io n . Ot h e rwis e , it is im p le m e n t e d b y

bad_pipe_r( ), wh ich ju s t re t u rn s a n e rro r co d e . S im ila rly, if t h e a cce s s t yp e a llo ws writ e o p e ra t io n s , t h e write m e t h o d is im p le m e n t e d b y t h e pipe_write( ) fu n ct io n ; o t h e rwis e , it is im p le m e n t e d b y bad_pipe_w( ), wh ich a ls o re t u rn s a n e rro r co d e .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

19.3 System V IPC IPC is a n a b b re via t io n fo r In t e rp ro ce s s Co m m u n ica t io n , a n d co m m o n ly re fe rs t o a s e t o f m e ch a n is m s t h a t a llo w a Us e r Mo d e p ro ce s s t o d o t h e fo llo win g : ● ● ●

S yn ch ro n ize it s e lf wit h o t h e r p ro ce s s e s b y m e a n s o f s e m a p h o re s S e n d m e s s a g e s t o o t h e r p ro ce s s e s o r re ce ive m e s s a g e s fro m t h e m S h a re a m e m o ry a re a wit h o t h e r p ro ce s s e s

S ys t e m V IPC firs t a p p e a re d in a d e ve lo p m e n t Un ix va ria n t ca lle d "Co lu m b u s Un ix" a n d la t e r wa s a d o p t e d b y AT&T's S ys t e m III. It is n o w fo u n d in m o s t Un ix s ys t e m s , in clu d in g Lin u x. IPC d a t a s t ru ct u re s a re cre a t e d d yn a m ica lly wh e n a p ro ce s s re q u e s t s a n IPC re s o u rce ( a s e m a p h o re , a m e s s a g e q u e u e , o r a s h a re d m e m o ry re g io n ) . An IPC re s o u rce is p e rs is t e n t : u n le s s e xp licit ly re m o ve d b y a p ro ce s s , it is ke p t in m e m o ry a n d re m a in s a va ila b le u n t il t h e s ys t e m is s h u t d o wn . An IPC re s o u rce m a y b e u s e d b y a n y p ro ce s s , in clu d in g t h o s e t h a t d o n o t s h a re t h e a n ce s t o r t h a t cre a t e d t h e re s o u rce . S in ce a p ro ce s s m a y re q u ire s e ve ra l IPC re s o u rce s o f t h e s a m e t yp e , e a ch n e w re s o u rce is id e n t ifie d b y a 3 2 - b it IPC k e y , wh ich is s im ila r t o t h e file p a t h n a m e in t h e s ys t e m 's d ire ct o ry t re e . Ea ch IPC re s o u rce a ls o h a s a 3 2 - b it IPC id e n t ifie r, wh ich is s o m e wh a t s im ila r t o t h e file d e s crip t o r a s s o cia t e d wit h a n o p e n file . IPC id e n t ifie rs a re a s s ig n e d t o IPC re s o u rce s b y t h e ke rn e l a n d a re u n iq u e wit h in t h e s ys t e m , wh ile IPC ke ys ca n b e fre e ly ch o s e n b y p ro g ra m m e rs . Wh e n t wo o r m o re p ro ce s s e s wis h t o co m m u n ica t e t h ro u g h a n IPC re s o u rce , t h e y a ll re fe r t o t h e IPC id e n t ifie r o f t h e re s o u rce .

19.3.1 Using an IPC Resource IPC re s o u rce s a re cre a t e d b y in vo kin g t h e semget( ), msgget( ), o r shmget( ) fu n ct io n s , d e p e n d in g o n wh e t h e r t h e n e w re s o u rce is a s e m a p h o re , a m e s s a g e q u e u e , o r a s h a re d m e m o ry re g io n . Th e m a in o b je ct ive o f e a ch o f t h e s e t h re e fu n ct io n s is t o d e rive fro m t h e IPC ke y ( p a s s e d a s t h e firs t p a ra m e t e r) t h e co rre s p o n d in g IPC id e n t ifie r, wh ich is t h e n u s e d b y t h e p ro ce s s fo r a cce s s in g t h e re s o u rce . If t h e re is n o IPC re s o u rce a lre a d y a s s o cia t e d wit h t h e IPC ke y, a n e w re s o u rce is cre a t e d . If e ve ryt h in g g o e s rig h t , t h e fu n ct io n re t u rn s a p o s it ive IPC id e n t ifie r; o t h e rwis e , it re t u rn s o n e o f t h e e rro r co d e s lis t e d in Ta b le 1 9 - 6 .

Ta b le 1 9 - 6 . Erro r c o d e s re t u rn e d w h ile re q u e s t in g a n I P C id e n t ifie r

Erro r c o d e D e s c rip t io n

EACCESS

Pro ce s s d o e s n o t h a ve p ro p e r a cce s s rig h t s .

EEXIST

Pro ce s s t rie d t o cre a t e a n IPC re s o u rce wit h t h e s a m e ke y a s o n e t h a t a lre a d y e xis t s .

EIDRM

Re s o u rce is m a rke d t o b e d e le t e d .

ENOENT

No IPC re s o u rce wit h t h e re q u e s t e d ke y e xis t s a n d t h e p ro ce s s d id n o t a s k t o cre a t e it .

ENOMEM

No m o re s t o ra g e is le ft fo r a n a d d it io n a l IPC re s o u rce .

ENOSPC

Ma xim u m lim it o n t h e n u m b e r o f IPC re s o u rce s h a s b e e n e xce e d e d .

As s u m e t h a t t wo in d e p e n d e n t p ro ce s s e s wa n t t o s h a re a co m m o n IPC re s o u rce . Th is ca n b e a ch ie ve d in t wo p o s s ib le wa ys : ●

Th e p ro ce s s e s a g re e o n s o m e fixe d , p re d e fin e d IPC ke y. Th is is t h e s im p le s t ca s e , a n d it wo rks q u it e we ll fo r a n y co m p le x a p p lica t io n im p le m e n t e d b y m a n y p ro ce s s e s . Ho we ve r, t h e re 's a ch a n ce t h a t t h e s a m e IPC ke y is ch o s e n b y a n o t h e r u n re la t e d p ro g ra m . In t h is ca s e , t h e IPC fu n ct io n s m ig h t b e s u cce s s fu lly in vo ke d a n d s t ill re t u rn t h e IPC id e n t ifie r o f t h e wro n g re s o u rce . [ 5 ] [5]

Th e ftok( ) fu n ct io n a t t e m p t s t o cre a t e a n e w ke y fro m a file p a t h n a m e a n d a n 8 - b it p ro je ct id e n t ifie r p a s s e d a s p a ra m e t e rs . It d o e s n o t g u a ra n t e e , h o we ve r, a u n iq u e ke y n u m b e r, s in ce t h e re is a s m a ll ch a n ce t h a t it will re t u rn t h e s a m e IPC ke y t o t wo d iffe re n t a p p lica t io n s u s in g d iffe re n t p a t h n a m e s a n d p ro je ct id e n t ifie rs .



On e p ro ce s s is s u e s a semget( ), msgget( ), o r shmget( ) fu n ct io n b y s p e cifyin g

IPC_PRIVATE a s it s IPC ke y. A n e w IPC re s o u rce is t h u s a llo ca t e d , a n d t h e p ro ce s s ca n e it h e r co m m u n ica t e it s IPC id e n t ifie r t o t h e o t h e r p ro ce s s in t h e a p p lica t io n [ 6 ] o r fo rk t h e o t h e r p ro ce s s it s e lf. Th is m e t h o d e n s u re s t h a t t h e IPC re s o u rce ca n n o t b e u s e d a ccid e n t a lly b y o t h e r a p p lica t io n s . [6]

Th is im p lie s , o f co u rs e , t h e e xis t e n ce o f a n o t h e r co m m u n ica t io n ch a n n e l b e t we e n t h e p ro ce s s e s n o t b a s e d o n IPC. Th e la s t p a ra m e t e r o f t h e semget( ), msgget( ), a n d shmget( ) fu n ct io n s ca n in clu d e t wo fla g s . IPC_CREAT s p e cifie s t h a t t h e IPC re s o u rce m u s t b e cre a t e d , if it d o e s n o t a lre a d y e xis t ; IPC_EXCL s p e cifie s t h a t t h e fu n ct io n m u s t fa il if t h e re s o u rce a lre a d y e xis t s a n d t h e

IPC_CREAT fla g is s e t . Eve n if t h e p ro ce s s u s e s t h e IPC_CREAT a n d IPC_EXCL fla g s , t h e re is n o wa y t o e n s u re e xclu s ive a cce s s t o a n IPC re s o u rce , s in ce o t h e r p ro ce s s e s m a y a lwa ys re fe r t o t h e re s o u rce b y u s in g it s IPC id e n t ifie r.

To m in im ize t h e ris k o f in co rre ct ly re fe re n cin g t h e wro n g re s o u rce , t h e ke rn e l d o e s n o t re cycle IPC id e n t ifie rs a s s o o n a s t h e y b e co m e fre e . In s t e a d , t h e IPC id e n t ifie r a s s ig n e d t o a re s o u rce is a lm o s t a lwa ys la rg e r t h a n t h e id e n t ifie r a s s ig n e d t o t h e p re vio u s ly a llo ca t e d re s o u rce o f t h e s a m e t yp e . ( Th e o n ly e xce p t io n o ccu rs wh e n t h e 3 2 - b it IPC id e n t ifie r o ve rflo ws . ) Ea ch IPC id e n t ifie r is co m p u t e d b y co m b in in g a s lo t u s a g e s e q u e n ce n u m b e r re la t ive t o t h e re s o u rce t yp e , a n a rb it ra ry s lo t in d e x fo r t h e a llo ca t e d re s o u rce , a n d a n a rb it ra ry va lu e ch o s e n in t h e ke rn e l t h a t is g re a t e r t h a n t h e m a xim u m n u m b e r o f a llo ca t a b le re s o u rce s . If we ch o o s e s t o re p re s e n t t h e s lo t u s a g e s e q u e n ce n u m b e r, M t o re p re s e n t t h e u p p e r b o u n d o n t h e n u m b e r o f a llo ca t a b le re s o u rce s , a n d i t o re p re s e n t t h e s lo t in d e x, wh e re 0

i< M, e a ch IPC re s o u rce 's ID is co m p u t e d a s fo llo ws :

IPC id e n t ifie r = s x M + i In Lin u x 2 . 4 , t h e va lu e o f M is s e t t o 3 2 , 7 6 8 ( IPCMNI m a cro ) . Th e s lo t u s a g e s e q u e n ce n u m b e r s is in it ia lize d t o 0 a n d is in cre m e n t e d b y 1 a t e ve ry re s o u rce a llo ca t io n . Wh e n s re a ch e s a p re d e fin e d t h re s h o ld , wh ich d e p e n d s o n t h e t yp e o f IPC re s o u rce , it re s t a rt s fro m 0. Eve ry t yp e o f IPC re s o u rce ( s e m a p h o re s , m e s s a g e q u e u e s , a n d s h a re d m e m o ry a re a s ) o wn s a n ipc_ids d a t a s t ru ct u re , wh ich in clu d e s t h e fie ld s s h o wn in Ta b le 1 9 - 7 .

Ta b le 1 9 - 7 . Th e fie ld s o f t h e ip c _ id s d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

int

size

Cu rre n t m a xim u m n u m b e r o f IPC re s o u rce s

int

in_use

Nu m b e r o f a llo ca t e d IPC re s o u rce s

int

max_id

Ma xim u m s lo t in d e x in u s e

unsigned short

seq

S lo t u s a g e s e q u e n ce n u m b e r fo r t h e n e xt a llo ca t io n

unsigned short

seq_max Ma xim u m s lo t u s a g e s e q u e n ce n u m b e r

struct semaphore

sem

S e m a p h o re p ro t e ct in g t h e ipc_ids d a t a s t ru ct u re

spinlock_t

ary

S p in lo ck p ro t e ct in g t h e IPC re s o u rce d e s crip t o rs

struct ipc_id *

entries Arra y o f IPC re s o u rce d e s crip t o rs

Th e size fie ld s t o re s t h e m a xim u m n u m b e r o f a llo ca t a b le IPC re s o u rce s o f t h e g ive n t yp e . Th e s ys t e m a d m in is t ra t o r m a y in cre a s e t h is va lu e fo r a n y re s o u rce t yp e b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ s e m , / p ro c/ s y s / k e rn e l/ m s g m n i, a n d / p ro c/ s y s / k e rn e l/ s h m m n i s p e cia l file s ,

re s p e ct ive ly. Th e entries fie ld p o in t s t o a n a rra y o f p o in t e rs t o kern_ipc_perm d a t a s t ru ct u re s , o n e fo r e ve ry a llo ca t a b le re s o u rce ( t h e size fie ld is a ls o t h e s ize o f t h e a rra y) . Ea ch

kern_ipc_perm d a t a s t ru ct u re is a s s o cia t e d wit h a n IPC re s o u rce a n d co n t a in s t h e fie ld s s h o wn in Ta b le 1 9 - 8 . Th e uid, gid, cuid, a n d cgid fie ld s s t o re t h e u s e r a n d g ro u p id e n t ifie rs o f t h e re s o u rce 's cre a t o r a n d t h e u s e r a n d g ro u p id e n t ifie rs o f t h e cu rre n t re s o u rce 's o wn e r, re s p e ct ive ly. Th e mode b it m a s k in clu d e s s ix fla g s , wh ich s t o re t h e re a d a n d writ e a cce s s p e rm is s io n s fo r t h e re s o u rce 's o wn e r, t h e re s o u rce 's g ro u p , a n d a ll o t h e r u s e rs . IPC a cce s s p e rm is s io n s a re s im ila r t o file a cce s s p e rm is s io n s d e s crib e d in S e ct io n 1 . 5 . 5 , e xce p t t h a t t h e Exe cu t e p e rm is s io n fla g is n o t u s e d .

Ta b le 1 9 - 8 . Th e fie ld s in t h e k e rn _ ip c _ p e rm s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

int

key

IPC ke y

unsigned int

uid

Own e r u s e r ID

unsigned int

gid

Own e r g ro u p ID

unsigned int

cuid

Cre a t o r u s e r ID

unsigned int

cgid

Cre a t o r g ro u p ID

unsigned short

mode

Pe rm is s io n b it m a s k

unsigned long

seq

S lo t u s a g e s e q u e n ce n u m b e r

Th e kern_ipc_perm d a t a s t ru ct u re a ls o in clu d e s a key fie ld ( wh ich co n t a in s t h e IPC ke y o f t h e co rre s p o n d in g re s o u rce ) a n d a seq fie ld ( wh ich s t o re s t h e s lo t u s a g e s e q u e n ce n u m b e r s u s e d t o co m p u t e t h e IPC id e n t ifie r o f t h e re s o u rce ) . Th e semctl( ), msgctl( ), a n d shmctl( ) fu n ct io n s m a y b e u s e d t o h a n d le IPC re s o u rce s . Th e IPC_SET s u b co m m a n d a llo ws a p ro ce s s t o ch a n g e t h e o wn e r's u s e r a n d g ro u p id e n t ifie rs a n d t h e p e rm is s io n b it m a s k in t h e ipc_perm d a t a s t ru ct u re . Th e

IPC_STAT a n d IPC_INFO s u b co m m a n d s re t rie ve s o m e in fo rm a t io n co n ce rn in g a re s o u rce . Fin a lly, t h e IPC_RMID s u b co m m a n d re le a s e s a n IPC re s o u rce . De p e n d in g o n t h e t yp e o f IPC re s o u rce , o t h e r s p e cia lize d s u b co m m a n d s a re a ls o a va ila b le . [ 7 ] [7]

An o t h e r IPC d e s ig n fla w is t h a t a Us e r Mo d e p ro ce s s ca n n o t a t o m ica lly cre a t e a n d in it ia lize a n IPC s e m a p h o re , s in ce t h e s e t wo

o p e ra t io n s a re p e rfo rm e d b y t wo d iffe re n t IPC fu n ct io n s . On ce a n IPC re s o u rce is cre a t e d , a p ro ce s s m a y a ct o n t h e re s o u rce b y m e a n s o f a fe w s p e cia lize d fu n ct io n s . A p ro ce s s m a y a cq u ire o r re le a s e a n IPC s e m a p h o re b y is s u in g t h e semop( ) fu n ct io n . Wh e n a p ro ce s s wa n t s t o s e n d o r re ce ive a n IPC m e s s a g e , it u s e s t h e

msgsnd( ) a n d msgrcv( ) fu n ct io n s , re s p e ct ive ly. Fin a lly, a p ro ce s s a t t a ch e s a n d d e t a ch e s a n IPC s h a re d m e m o ry re g io n in it s a d d re s s s p a ce b y m e a n s o f t h e shmat( ) a n d shmdt( ) fu n ct io n s , re s p e ct ive ly. 19.3.2 The ipc( ) System Call All IPC fu n ct io n s m u s t b e im p le m e n t e d t h ro u g h s u it a b le Lin u x s ys t e m ca lls . Act u a lly, in t h e 8 0 x 8 6 a rch it e ct u re , t h e re is ju s t o n e IPC s ys t e m ca ll n a m e d ipc( ). Wh e n a p ro ce s s in vo ke s a n IPC fu n ct io n , le t 's s a y msgget( ), it re a lly in vo ke s a wra p p e r fu n ct io n in t h e C lib ra ry. Th is in t u rn in vo ke s t h e ipc( ) s ys t e m ca ll b y p a s s in g t o it a ll t h e p a ra m e t e rs o f

msgget( ) p lu s a p ro p e r s u b co m m a n d co d e —in t h is ca s e , MSGGET. Th e sys_ipc( ) s e rvice ro u t in e e xa m in e s t h e s u b co m m a n d co d e a n d in vo ke s t h e ke rn e l fu n ct io n t h a t im p le m e n t s t h e re q u e s t e d s e rvice . Th e ipc( ) "m u lt ip le xe r" s ys t e m ca ll is a le g a cy fro m o ld e r Lin u x ve rs io n s , wh ich in clu d e d t h e IPC co d e in a d yn a m ic m o d u le ( s e e Ap p e n d ix B) . It d id n o t m a ke m u ch s e n s e t o re s e rve s e ve ra l s ys t e m ca ll e n t rie s in t h e system_call t a b le fo r a ke rn e l co m p o n e n t t h a t co u ld b e m is s in g , s o t h e ke rn e l d e s ig n e rs a d o p t e d t h e m u lt ip le xe r a p p ro a ch . No wa d a ys , S ys t e m V IPC ca n n o lo n g e r b e co m p ile d a s a d yn a m ic m o d u le , a n d t h e re is n o ju s t ifica t io n fo r u s in g a s in g le IPC s ys t e m ca ll. As a m a t t e r o f fa ct , Lin u x p ro vid e s o n e s ys t e m ca ll fo r e a ch IPC fu n ct io n o n He wle t t - Pa cka rd 's Alp h a a rch it e ct u re a n d o n In t e l's IA64.

19.3.3 IPC Semaphores IPC s e m a p h o re s a re q u it e s im ila r t o t h e ke rn e l s e m a p h o re s in t ro d u ce d in Ch a p t e r 5 ; t h e y a re co u n t e rs u s e d t o p ro vid e co n t ro lle d a cce s s t o s h a re d d a t a s t ru ct u re s fo r m u lt ip le p ro ce s s e s . Th e s e m a p h o re va lu e is p o s it ive if t h e p ro t e ct e d re s o u rce is a va ila b le , a n d 0 if t h e p ro t e ct e d re s o u rce is cu rre n t ly n o t a va ila b le . A p ro ce s s t h a t wa n t s t o a cce s s t h e re s o u rce t rie s t o d e cre m e n t t h e s e m a p h o re va lu e ; t h e ke rn e l, h o we ve r, b lo cks t h e p ro ce s s u n t il t h e o p e ra t io n o n t h e s e m a p h o re yie ld s a p o s it ive va lu e . Wh e n a p ro ce s s re lin q u is h e s a p ro t e ct e d re s o u rce , it in cre m e n t s it s s e m a p h o re va lu e ; in d o in g s o , a n y o t h e r p ro ce s s wa it in g fo r t h e s e m a p h o re is wo ke n u p . Act u a lly, IPC s e m a p h o re s a re m o re co m p lica t e d t o h a n d le t h a n ke rn e l s e m a p h o re s fo r t wo m a in re a s o n s : ●

Ea ch IPC s e m a p h o re is a s e t o f o n e o r m o re s e m a p h o re va lu e s , n o t ju s t a s in g le va lu e like a ke rn e l s e m a p h o re . Th is m e a n s t h a t t h e s a m e IPC re s o u rce ca n p ro t e ct s e ve ra l in d e p e n d e n t s h a re d d a t a s t ru ct u re s . Th e n u m b e r o f s e m a p h o re va lu e s in e a ch IPC s e m a p h o re m u s t b e s p e cifie d a s a p a ra m e t e r o f t h e semget( ) fu n ct io n wh e n t h e re s o u rce is b e in g a llo ca t e d . Fro m n o w o n , we 'll re fe r t o t h e co u n t e rs in s id e a n IPC s e m a p h o re a s p rim it iv e s e m a p h o re s . Th e re a re b o u n d s b o t h o n t h e n u m b e r



o f IPC s e m a p h o re re s o u rce s ( b y d e fa u lt , 1 2 8 ) a n d o n t h e n u m b e r o f p rim it ive s e m a p h o re s in s id e a s in g le IPC s e m a p h o re re s o u rce ( b y d e fa u lt , 2 5 0 ) ; h o we ve r, t h e s ys t e m a d m in is t ra t o r ca n e a s ily m o d ify t h e s e b o u n d s b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ s e m file . S ys t e m V IPC s e m a p h o re s p ro vid e a fa il- s a fe m e ch a n is m fo r s it u a t io n s in wh ich a p ro ce s s d ie s wit h o u t b e in g a b le t o u n d o t h e o p e ra t io n s t h a t it p re vio u s ly is s u e d o n a s e m a p h o re . Wh e n a p ro ce s s ch o o s e s t o u s e t h is m e ch a n is m , t h e re s u lt in g o p e ra t io n s a re ca lle d u n d o a b le s e m a p h o re o p e ra t io n s . Wh e n t h e p ro ce s s d ie s , a ll o f it s IPC s e m a p h o re s ca n re ve rt t o t h e va lu e s t h e y wo u ld h a ve h a d if t h e p ro ce s s h a d n e ve r s t a rt e d it s o p e ra t io n s . Th is ca n h e lp p re ve n t o t h e r p ro ce s s e s t h a t u s e t h e s a m e s e m a p h o re s fro m re m a in in g b lo cke d in d e fin it e ly a s a co n s e q u e n ce o f t h e t e rm in a t in g p ro ce s s fa ilin g t o m a n u a lly u n d o it s s e m a p h o re o p e ra t io n s .

Firs t , we 'll b rie fly s ke t ch t h e t yp ica l s t e p s p e rfo rm e d b y a p ro ce s s wis h in g t o a cce s s o n e o r m o re re s o u rce s p ro t e ct e d b y a n IPC s e m a p h o re : 1 . In vo ke s t h e semget( ) wra p p e r fu n ct io n t o g e t t h e IPC s e m a p h o re id e n t ifie r, s p e cifyin g a s t h e p a ra m e t e r t h e IPC ke y o f t h e IPC s e m a p h o re t h a t p ro t e ct s t h e s h a re d re s o u rce s . If t h e p ro ce s s wa n t s t o cre a t e a n e w IPC s e m a p h o re , it a ls o s p e cifie s t h e IPC_CREATE o r IPC_PRIVATE fla g a n d t h e n u m b e r o f p rim it ive s e m a p h o re s re q u ire d ( s e e S e ct io n 1 9 . 3 . 1 e a rlie r in t h is ch a p t e r) . 2 . In vo ke s t h e semop( ) wra p p e r fu n ct io n t o t e s t a n d d e cre m e n t a ll p rim it ive s e m a p h o re va lu e s in vo lve d . If a ll t h e t e s t s s u cce e d , t h e d e cre m e n t s a re p e rfo rm e d , t h e fu n ct io n t e rm in a t e s , a n d t h e p ro ce s s is a llo we d t o a cce s s t h e p ro t e ct e d re s o u rce s . If s o m e s e m a p h o re s a re in u s e , t h e p ro ce s s is u s u a lly s u s p e n d e d u n t il s o m e o t h e r p ro ce s s re le a s e s t h e re s o u rce s . Th e fu n ct io n re ce ive s a s p a ra m e t e rs t h e IPC s e m a p h o re id e n t ifie r, a n a rra y o f in t e g e rs s p e cifyin g t h e o p e ra t io n s t o b e a t o m ica lly p e rfo rm e d o n t h e p rim it ive s e m a p h o re s , a n d t h e n u m b e r o f s u ch o p e ra t io n s . Op t io n a lly, t h e p ro ce s s m a y s p e cify t h e SEM_UNDO fla g , wh ich in s t ru ct s t h e ke rn e l t o re ve rs e t h e o p e ra t io n s , s h o u ld t h e p ro ce s s e xit wit h o u t re le a s in g t h e p rim it ive s e m a p h o re s . 3 . Wh e n re lin q u is h in g t h e p ro t e ct e d re s o u rce s , in vo ke s t h e semop( ) fu n ct io n a g a in t o a t o m ica lly in cre m e n t a ll p rim it ive s e m a p h o re s in vo lve d . 4 . Op t io n a lly, in vo ke s t h e semctl( ) wra p p e r fu n ct io n , s p e cifyin g t h e IPC_RMID co m m a n d t o re m o ve t h e IPC s e m a p h o re fro m t h e s ys t e m . No w we ca n d is cu s s h o w t h e ke rn e l im p le m e n t s IPC s e m a p h o re s . Th e d a t a s t ru ct u re s in vo lve d a re s h o wn in Fig u re 1 9 - 1 . Th e sem_ids va ria b le s t o re s t h e ipc_ids d a t a s t ru ct u re o f t h e IPC s e m a p h o re re s o u rce t yp e ; it s entries fie ld is a n a rra y o f p o in t e rs t o sem_array d a t a s t ru ct u re s , o n e it e m fo r e ve ry IPC s e m a p h o re re s o u rce . Fig u re 1 9 - 1 . I P C s e m a p h o re d a t a s t ru c t u re s

Fo rm a lly, t h e a rra y s t o re s p o in t e rs t o kern_ipc_perm d a t a s t ru ct u re s , b u t e a ch s t ru ct u re is s im p ly t h e firs t fie ld o f t h e sem_array d a t a s t ru ct u re . All fie ld s o f t h e sem_array d a t a s t ru ct u re a re s h o wn in Ta b le 1 9 - 9 .

Ta b le 1 9 - 9 . Th e fie ld s in t h e s e m _ a rra y d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct kern_ipc_perm

sem_perm

kern_ipc_perm d a t a s t ru ct u re

long

sem_otime

Tim e s t a m p o f la s t semop( )

long

sem_ctime

Tim e s t a m p o f la s t ch a n g e

struct sem *

sem_base

Po in t e r t o firs t sem s t ru ct u re

struct sem_queue *

sem_pending

Pe n d in g o p e ra t io n s

struct sem_queue **

sem_pending_last

La s t p e n d in g o p e ra t io n

struct sem_undo *

undo

Un d o re q u e s t s

unsigned short

sem_nsems

Nu m b e r o f s e m a p h o re s in a rra y

Th e sem_base fie ld p o in t s t o a n a rra y o f struct sem d a t a s t ru ct u re s , o n e fo r e ve ry IPC p rim it ive s e m a p h o re . Th e la t t e r d a t a s t ru ct u re in clu d e s o n ly t wo fie ld s :

semval Th e va lu e o f t h e s e m a p h o re 's co u n t e r.

sempid Th e PID o f t h e la s t p ro ce s s t h a t a cce s s e d t h e s e m a p h o re . Th is va lu e ca n b e q u e rie d b y a p ro ce s s t h ro u g h t h e semctl( ) wra p p e r fu n ct io n .

19.3.3.1 Undoable semaphore operations If a p ro ce s s a b o rt s s u d d e n ly, it ca n n o t u n d o t h e o p e ra t io n s t h a t it s t a rt e d ( fo r in s t a n ce , re le a s e t h e s e m a p h o re s it re s e rve d ) ; s o b y d e cla rin g t h e m u n d o a b le , t h e p ro ce s s le t s t h e ke rn e l re t u rn t h e s e m a p h o re s t o a co n s is t e n t s t a t e a n d a llo w o t h e r p ro ce s s e s t o p ro ce e d . Pro ce s s e s ca n re q u e s t u n d o a b le o p e ra t io n s b y s p e cifyin g t h e SEM_UNDO fla g in t h e semop(

) fu n ct io n . In fo rm a t io n t o h e lp t h e ke rn e l re ve rs e t h e u n d o a b le o p e ra t io n s p e rfo rm e d b y a g ive n p ro ce s s o n a g ive n IPC s e m a p h o re re s o u rce is s t o re d in a sem_undo d a t a s t ru ct u re . It e s s e n t ia lly co n t a in s t h e IPC id e n t ifie r o f t h e s e m a p h o re a n d a n a rra y o f in t e g e rs re p re s e n t in g t h e ch a n g e s t o t h e p rim it ive s e m a p h o re 's va lu e s ca u s e d b y a ll u n d o a b le o p e ra t io n s p e rfo rm e d b y t h e p ro ce s s . A s im p le e xa m p le ca n illu s t ra t e h o w s u ch sem_undo e le m e n t s a re u s e d . Co n s id e r a p ro ce s s t h a t u s e s a n IPC s e m a p h o re re s o u rce co n t a in in g fo u r p rim it ive s e m a p h o re s . S u p p o s e t h a t it in vo ke s t h e semop( ) fu n ct io n t o in cre m e n t t h e firs t co u n t e r b y 1 a n d d e cre m e n t t h e s e co n d b y 2 . If it s p e cifie s t h e SEM_UNDO fla g , t h e in t e g e r in t h e firs t a rra y e le m e n t in t h e

sem_undo d a t a s t ru ct u re is d e cre m e n t e d b y 1 , t h e in t e g e r in t h e s e co n d e le m e n t is in cre m e n t e d b y 2 , a n d t h e o t h e r t wo in t e g e rs a re le ft u n ch a n g e d . Fu rt h e r u n d o a b le o p e ra t io n s o n t h e IPC s e m a p h o re p e rfo rm e d b y t h e s a m e p ro ce s s ch a n g e t h e in t e g e rs s t o re d in t h e sem_undo s t ru ct u re a cco rd in g ly. Wh e n t h e p ro ce s s e xit s , a n y n o n ze ro va lu e in t h a t a rra y co rre s p o n d s t o o n e o r m o re u n b a la n ce d o p e ra t io n s o n t h e co rre s p o n d in g p rim it ive s e m a p h o re ; t h e ke rn e l re ve rs e s t h e s e o p e ra t io n s , s im p ly a d d in g t h e n o n ze ro va lu e t o t h e co rre s p o n d in g s e m a p h o re 's co u n t e r. In o t h e r wo rd s , t h e ch a n g e s m a d e b y t h e a b o rt e d p ro ce s s a re b a cke d o u t wh ile t h e ch a n g e s m a d e b y o t h e r p ro ce s s e s a re s t ill re fle ct e d in t h e s t a t e o f t h e s e m a p h o re s . Fo r e a ch p ro ce s s , t h e ke rn e l ke e p s t ra ck o f a ll s e m a p h o re re s o u rce s h a n d le d wit h u n d o a b le o p e ra t io n s s o t h a t it ca n ro ll t h e m b a ck if t h e p ro ce s s u n e xp e ct e d ly e xit s . Fu rt h e rm o re , fo r e a ch s e m a p h o re , t h e ke rn e l h a s t o ke e p t ra ck o f a ll it s sem_undo s t ru ct u re s s o it ca n q u ickly a cce s s t h e m wh e n e ve r a p ro ce s s u s e s semctl( ) t o fo rce a n e xp licit va lu e in t o a p rim it ive s e m a p h o re 's co u n t e r o r t o d e s t ro y a n IPC s e m a p h o re re s o u rce . Th e ke rn e l is a b le t o h a n d le t h e s e t a s ks e fficie n t ly, t h a n ks t o t wo lis t s , wh ich we d e n o t e a s

t h e p e r- p ro ce s s a n d t h e p e r- s e m a p h o re lis t s . Th e firs t lis t ke e p s t ra ck o f a ll s e m a p h o re s o p e ra t e d u p o n b y a g ive n p ro ce s s wit h u n d o a b le o p e ra t io n s . Th e s e co n d lis t ke e p s t ra ck o f a ll p ro ce s s e s t h a t a re a ct in g o n a g ive n s e m a p h o re wit h u n d o a b le o p e ra t io n s . Mo re p re cis e ly: ●

Th e p e r- p ro ce s s lis t in clu d e s a ll sem_undo d a t a s t ru ct u re s co rre s p o n d in g t o IPC s e m a p h o re s o n wh ich t h e p ro ce s s h a s p e rfo rm e d u n d o a b le o p e ra t io n s . Th e semundo fie ld o f t h e p ro ce s s d e s crip t o r p o in t s t o t h e firs t e le m e n t o f t h e lis t , wh ile t h e

proc_next fie ld o f e a ch sem_undo d a t a s t ru ct u re p o in t s t o t h e n e xt e le m e n t in t h e ●

lis t . Th e p e r- s e m a p h o re lis t in clu d e s a ll sem_undo d a t a s t ru ct u re s co rre s p o n d in g t o t h e p ro ce s s e s t h a t p e rfo rm e d u n d o a b le o p e ra t io n s o n t h e s e m a p h o re . Th e undo fie ld o f t h e semid_ds d a t a s t ru ct u re p o in t s t o t h e firs t e le m e n t o f t h e lis t , wh ile t h e

id_next fie ld o f e a ch sem_undo d a t a s t ru ct u re p o in t s t o t h e n e xt e le m e n t in t h e lis t . Th e p e r- p ro ce s s lis t is u s e d wh e n a p ro ce s s t e rm in a t e s . Th e sem_exit( ) fu n ct io n , wh ich is in vo ke d b y do_exit( ), wa lks t h ro u g h t h e lis t a n d re ve rs e s t h e e ffe ct o f a n y u n b a la n ce d o p e ra t io n fo r e ve ry IPC s e m a p h o re t o u ch e d b y t h e p ro ce s s . By co n t ra s t , t h e p e r- s e m a p h o re lis t is m a in ly u s e d wh e n a p ro ce s s in vo ke s t h e semctl( ) fu n ct io n t o fo rce a n e xp licit va lu e in t o a p rim it ive s e m a p h o re . Th e ke rn e l s e t s t h e co rre s p o n d in g e le m e n t t o 0 in t h e a rra ys o f a ll sem_undo d a t a s t ru ct u re s re fe rrin g t o t h a t IPC s e m a p h o re re s o u rce , s in ce it wo u ld n o lo n g e r m a ke a n y s e n s e t o re ve rs e t h e e ffe ct o f p re vio u s u n d o a b le o p e ra t io n s p e rfo rm e d o n t h a t p rim it ive s e m a p h o re . Mo re o ve r, t h e p e r- s e m a p h o re lis t is a ls o u s e d wh e n a n IPC s e m a p h o re is d e s t ro ye d ; a ll re la t e d sem_undo d a t a s t ru ct u re s a re in va lid a t e d b y s e t t in g t h e

semid fie ld t o - 1 . [ 8 ] [8]

No t ice t h a t t h e y a re ju s t in va lid a t e d a n d n o t fre e d , s in ce it wo u ld b e t o o co s t ly t o re m o ve t h e d a t a s t ru ct u re s fro m t h e p e rp ro ce s s lis t s o f a ll p ro ce s s e s .

19.3.3.2 The queue of pending requests Th e ke rn e l a s s o cia t e s a q u e u e o f p e n d in g re q u e s t s wit h e a ch IPC s e m a p h o re t o id e n t ify p ro ce s s e s t h a t a re wa it in g o n o n e ( o r m o re ) o f t h e s e m a p h o re s in t h e a rra y. Th e q u e u e is a d o u b ly lin ke d lis t o f sem_queue d a t a s t ru ct u re s wh o s e fie ld s a re s h o wn in Ta b le 1 9 - 1 0 . Th e firs t a n d la s t p e n d in g re q u e s t s in t h e q u e u e a re re fe re n ce d , re s p e ct ive ly, b y t h e sem_pending a n d sem_pending_last fie ld s o f t h e sem_array s t ru ct u re . Th is la s t fie ld a llo ws t h e lis t t o b e h a n d le d e a s ily a s a FIFO; n e w p e n d in g re q u e s t s a re a d d e d t o t h e e n d o f t h e lis t s o t h e y will b e s e rvice d la t e r. Th e m o s t im p o rt a n t fie ld s o f a p e n d in g re q u e s t a re nsops ( wh ich s t o re s t h e n u m b e r o f p rim it ive s e m a p h o re s in vo lve d in t h e p e n d in g o p e ra t io n ) a n d sops ( wh ich p o in t s t o a n a rra y o f in t e g e r va lu e s d e s crib in g e a ch s e m a p h o re o p e ra t io n ) . Th e sleeper fie ld s t o re s t h e d e s crip t o r a d d re s s o f t h e s le e p in g p ro ce s s t h a t re q u e s t e d t h e o p e ra t io n .

Ta b le 1 9 - 1 0 . Th e fie ld s in t h e s e m _ q u e u e d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct sem_queue *

next

Po in t e r t o n e xt q u e u e e le m e n t

struct sem_queue ** prev

Po in t e r t o p re vio u s q u e u e e le m e n t

struct task_struct * sleeper Po in t e r t o t h e s le e p in g p ro ce s s t h a t re q u e s t e d t h e s e m a p h o re o p e ra t io n

struct sem_undo *

undo

Po in t e r t o sem_undo s t ru ct u re

int

pid

Pro ce s s id e n t ifie r

int

status Co m p le t io n s t a t u s o f o p e ra t io n

struct sem_array *

sma

Po in t e r t o IPC s e m a p h o re d e s crip t o r

int

id

S lo t in d e x o f t h e IPC s e m a p h o re re s o u rce

struct sembuf *

sops

Po in t e r t o a rra y o f p e n d in g o p e ra t io n s

int

nsops

Nu m b e r o f p e n d in g o p e ra t io n s

int

alter

Fla g in d ica t in g t h a t t h e o p e ra t io n s e t s t h e s e m a p h o re va lu e

Fig u re 1 9 - 1 illu s t ra t e s a n IPC s e m a p h o re t h a t h a s t h re e p e n d in g re q u e s t s . Two o f t h e m re fe r t o u n d o a b le o p e ra t io n s , s o t h e undo fie ld o f t h e sem_queue d a t a s t ru ct u re p o in t s t o t h e co rre s p o n d in g sem_undo s t ru ct u re ; t h e t h ird p e n d in g re q u e s t h a s a NULL undo fie ld s in ce t h e co rre s p o n d in g o p e ra t io n is n o t u n d o a b le .

19.3.4 IPC Messages Pro ce s s e s ca n co m m u n ica t e wit h o n e a n o t h e r b y m e a n s o f IPC m e s s a g e s . Ea ch m e s s a g e g e n e ra t e d b y a p ro ce s s is s e n t t o a n IPC m e s s a g e q u e u e , wh e re it s t a ys u n t il a n o t h e r p ro ce s s re a d s it . A m e s s a g e is co m p o s e d o f a fixe d - s ize h e a d e r a n d a va ria b le - le n g t h t e x t ; it ca n b e la b e le d wit h a n in t e g e r va lu e ( t h e m e s s a g e t y p e ) , wh ich a llo ws a p ro ce s s t o s e le ct ive ly re t rie ve m e s s a g e s fro m it s m e s s a g e q u e u e . [ 9 ] On ce a p ro ce s s h a s re a d a m e s s a g e fro m a n IPC m e s s a g e q u e u e , t h e ke rn e l d e s t ro ys t h e m e s s a g e ; t h e re fo re , o n ly o n e p ro ce s s ca n re ce ive a g ive n m e s s a g e . [9]

As we 'll s e e , t h e m e s s a g e q u e u e is im p le m e n t e d b y m e a n s o f

a lin ke d lis t . S in ce m e s s a g e s ca n b e re t rie ve d in a n o rd e r d iffe re n t fro m "firs t in , firs t o u t , " t h e n a m e "m e s s a g e q u e u e " is n o t a p p ro p ria t e . Ho we ve r, n e w m e s s a g e s a re a lwa ys p u t a t t h e e n d o f t h e lin ke d lis t . To s e n d a m e s s a g e , a p ro ce s s in vo ke s t h e msgsnd( ) fu n ct io n , p a s s in g t h e fo llo win g a s p a ra m e t e rs : ● ● ●

Th e IPC id e n t ifie r o f t h e d e s t in a t io n m e s s a g e q u e u e Th e s ize o f t h e m e s s a g e t e xt Th e a d d re s s o f a Us e r Mo d e b u ffe r t h a t co n t a in s t h e m e s s a g e t yp e im m e d ia t e ly fo llo we d b y t h e m e s s a g e t e xt

To re t rie ve a m e s s a g e , a p ro ce s s in vo ke s t h e msgrcv( ) fu n ct io n , p a s s in g t o it :

● ●

● ●

Th e IPC id e n t ifie r o f t h e IPC m e s s a g e q u e u e re s o u rce Th e p o in t e r t o a Us e r Mo d e b u ffe r t o wh ich t h e m e s s a g e t yp e a n d m e s s a g e t e xt s h o u ld b e co p ie d Th e s ize o f t h is b u ffe r A va lu e t t h a t s p e cifie s wh a t m e s s a g e s h o u ld b e re t rie ve d

If t h e va lu e t is 0 , t h e firs t m e s s a g e in t h e q u e u e is re t u rn e d . If t is p o s it ive , t h e firs t m e s s a g e in t h e q u e u e wit h it s t yp e e q u a l t o t is re t u rn e d . Fin a lly, if t is n e g a t ive , t h e fu n ct io n re t u rn s t h e firs t m e s s a g e wh o s e m e s s a g e t yp e is t h e lo we s t va lu e le s s t h a n o r e q u a l t o t h e a b s o lu t e va lu e o f t . To a vo id re s o u rce e xh a u s t io n , t h e re a re s o m e lim it s o n t h e n u m b e r o f IPC m e s s a g e q u e u e re s o u rce s a llo we d ( b y d e fa u lt , 1 6 ) , o n t h e s ize o f e a ch m e s s a g e ( b y d e fa u lt , 8 , 1 9 2 b yt e s ) , a n d o n t h e m a xim u m t o t a l s ize o f t h e m e s s a g e s in a q u e u e ( b y d e fa u lt , 1 6 , 3 8 4 b yt e s ) . As u s u a l, h o we ve r, t h e s ys t e m a d m in is t ra t o r ca n t u n e t h e s e va lu e s b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ m s g m n i, / p ro c/ s y s / k e rn e l/ m s g m n b , a n d / p ro c/ s y s / k e rn e l/ m s g m a x file s , re s p e ct ive ly. Th e d a t a s t ru ct u re s a s s o cia t e d wit h IPC m e s s a g e q u e u e s a re s h o wn in Fig u re 1 9 - 2 . Th e

msg_ids va ria b le s t o re s t h e ipc_ids d a t a s t ru ct u re o f t h e IPC m e s s a g e q u e u e re s o u rce t yp e ; it s entries fie ld is a n a rra y o f p o in t e rs t o msg_queue d a t a s t ru ct u re s —o n e it e m fo r e ve ry IPC m e s s a g e q u e u e re s o u rce . Fo rm a lly, t h e a rra y s t o re s p o in t e rs t o kern_ipc_perm d a t a s t ru ct u re s , b u t e a ch s u ch s t ru ct u re is s im p ly t h e firs t fie ld o f t h e msg_queue d a t a s t ru ct u re . All fie ld s o f t h e msg_queue d a t a s t ru ct u re a re s h o wn in Ta b le 1 9 - 1 1 . Fig u re 1 9 - 2 . I P C m e s s a g e q u e u e d a t a s t ru c t u re s

Ta b le 1 9 - 1 1 . Th e m s g _ q u e u e d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct ipc_perm

q_perm

kern_ipc_perm d a t a s t ru ct u re

long

q_stime

Tim e o f la s t msgsnd( )

long

q_rtime

Tim e o f la s t msgrcv( )

long

q_ctime

La s t ch a n g e t im e

unsigned long

q_qcbytes

Nu m b e r o f b yt e s in q u e u e

unsigned long

q_qnum

Nu m b e r o f m e s s a g e s in q u e u e

unsigned long

q_qbytes

Ma xim u m n u m b e r o f b yt e s in q u e u e

int

q_lspid

PID o f la s t msgsnd( )

int

q_lrpid

PID o f la s t msgrcv( )

struct list_head

q_messages

Lis t o f m e s s a g e s in q u e u e

struct list_head

q_receivers

Lis t o f p ro ce s s e s re ce ivin g m e s s a g e s

struct list_head

q_senders

Lis t o f p ro ce s s e s s e n d in g m e s s a g e s

Th e m o s t im p o rt a n t fie ld is q_messages, wh ich re p re s e n t s t h e h e a d ( i. e . , t h e firs t d u m m y e le m e n t ) o f a d o u b ly lin ke d circu la r lis t co n t a in in g a ll m e s s a g e s cu rre n t ly in t h e q u e u e . Ea ch m e s s a g e is b ro ke n in o n e o r m o re p a g e s , wh ich a re d yn a m ica lly a llo ca t e d . Th e b e g in n in g o f t h e firs t p a g e s t o re s t h e m e s s a g e h e a d e r, wh ich is a d a t a s t ru ct u re o f t yp e msg_msg; it s fie ld s a re lis t e d in Ta b le 1 9 - 1 2 . Th e m_list fie ld s t o re s t h e p o in t e rs t o t h e p re vio u s a n d n e xt m e s s a g e s in t h e q u e u e . Th e m e s s a g e t e xt s t a rt s rig h t a ft e r t h e msg_msg d e s crip t o r; if t h e m e s s a g e is lo n g e r t h a n 4 , 0 7 2 b yt e s ( t h e p a g e s ize m in u s t h e s ize o f t h e msg_msg d e s crip t o r) , it co n t in u e s o n a n o t h e r p a g e , wh o s e a d d re s s is s t o re d in t h e next fie ld o f t h e msg_msg d e s crip t o r. Th e s e co n d p a g e fra m e s t a rt s wit h a d e s crip t o r o f t yp e

msg_msgseg, wh ich ju s t in clu d e s a next p o in t e r s t o rin g t h e a d d re s s o f a n o p t io n a l t h ird pa ge , a nd so on.

Ta b le 1 9 - 1 2 . Th e m s g _ m s g d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct list_head

m_list

Po in t e rs fo r m e s s a g e lis t

long

m_type

Me s s a g e t yp e

int

m_ts

Me s s a g e t e xt s ize

struct msg_msgseg *

next

Ne xt p o rt io n o f t h e m e s s a g e

Wh e n t h e m e s s a g e q u e u e is fu ll ( e it h e r t h e m a xim u m n u m b e r o f m e s s a g e s o r t h e m a xim u m t o t a l s ize h a s b e e n re a ch e d ) , p ro ce s s e s t h a t t ry t o e n q u e u e n e w m e s s a g e s m a y b e b lo cke d . Th e q_senders fie ld o f t h e msg_queue d a t a s t ru ct u re is t h e h e a d o f a lis t t h a t in clu d e s t h e p o in t e rs t o t h e d e s crip t o rs o f a ll b lo cke d s e n d in g p ro ce s s e s . Eve n re ce ivin g p ro ce s s e s m a y b e b lo cke d wh e n t h e m e s s a g e q u e u e is e m p t y ( o r t h e p ro ce s s s p e cifie d a t yp e o f m e s s a g e n o t p re s e n t in t h e q u e u e ) . Th e q_receivers fie ld o f t h e

msg_queue d a t a s t ru ct u re is t h e h e a d o f a lis t o f msg_receiver d a t a s t ru ct u re s , o n e fo r e ve ry b lo cke d re ce ivin g p ro ce s s . Ea ch o f t h e s e s t ru ct u re s e s s e n t ia lly in clu d e s a p o in t e r t o t h e p ro ce s s d e s crip t o r, a p o in t e r t o t h e msg_msg s t ru ct u re o f t h e m e s s a g e , a n d t h e t yp e o f t h e re q u e s t e d m e s s a g e .

19.3.5 IPC Shared Memory Th e m o s t u s e fu l IPC m e ch a n is m is s h a re d m e m o ry, wh ich a llo ws t wo o r m o re p ro ce s s e s t o a cce s s s o m e co m m o n d a t a s t ru ct u re s b y p la cin g t h e m in a n IPC s h a re d m e m o ry re g io n . Ea ch p ro ce s s t h a t wa n t s t o a cce s s t h e d a t a s t ru ct u re s in clu d e d in a n IPC s h a re d m e m o ry re g io n m u s t a d d t o it s a d d re s s s p a ce a n e w m e m o ry re g io n ( s e e S e ct io n 8 . 3 ) , wh ich m a p s t h e p a g e fra m e s a s s o cia t e d wit h t h e IPC s h a re d m e m o ry re g io n . S u ch p a g e fra m e s ca n t h e n b e e a s ily h a n d le d b y t h e ke rn e l t h ro u g h d e m a n d p a g in g ( s e e S e ct io n 8 . 4 . 3 ) .

As wit h s e m a p h o re s a n d m e s s a g e q u e u e s , t h e shmget( ) fu n ct io n is in vo ke d t o g e t t h e IPC id e n t ifie r o f a s h a re d m e m o ry re g io n , o p t io n a lly cre a t in g it if it d o e s n o t a lre a d y e xis t . Th e shmat( ) fu n ct io n is in vo ke d t o "a t t a ch " a n IPC s h a re d m e m o ry re g io n t o a p ro ce s s . It re ce ive s a s it s p a ra m e t e r t h e id e n t ifie r o f t h e IPC s h a re d m e m o ry re s o u rce a n d t rie s t o a d d a s h a re d m e m o ry re g io n t o t h e a d d re s s s p a ce o f t h e ca llin g p ro ce s s . Th e ca llin g p ro ce s s ca n re q u ire a s p e cific s t a rt in g lin e a r a d d re s s fo r t h e m e m o ry re g io n , b u t t h e a d d re s s is u s u a lly u n im p o rt a n t , a n d e a ch p ro ce s s a cce s s in g t h e s h a re d m e m o ry re g io n ca n u s e a d iffe re n t a d d re s s in it s o wn a d d re s s s p a ce . Th e p ro ce s s 's Pa g e Ta b le s a re le ft u n ch a n g e d b y shmat(

). We d e s crib e la t e r wh a t t h e ke rn e l d o e s wh e n t h e p ro ce s s t rie s t o a cce s s a p a g e t h a t b e lo n g s t o t h e n e w m e m o ry re g io n . Th e shmdt( ) fu n ct io n is in vo ke d t o "d e t a ch " a n IPC s h a re d m e m o ry re g io n s p e cifie d b y it s IPC id e n t ifie r—t h a t is , t o re m o ve t h e co rre s p o n d in g m e m o ry re g io n fro m t h e p ro ce s s 's a d d re s s s p a ce . Re ca ll t h a t a n IPC s h a re d m e m o ry re s o u rce is p e rs is t e n t : e ve n if n o p ro ce s s is u s in g it , t h e co rre s p o n d in g p a g e s ca n n o t b e d is ca rd e d , a lt h o u g h t h e y ca n b e s wa p p e d o u t . As fo r t h e o t h e r t yp e s o f IPC re s o u rce s , in o rd e r t o a vo id o ve ru s e o f m e m o ry b y Us e r Mo d e p ro ce s s e s , t h e re a re s o m e lim it s o n t h e a llo we d n u m b e r o f IPC s h a re d m e m o ry re g io n s ( b y d e fa u lt , 4 , 0 9 6 ) , o n t h e s ize o f e a ch s e g m e n t ( b y d e fa u lt , 3 2 m e g a b yt e s ) , a n d o n t h e m a xim u m t o t a l s ize o f a ll s e g m e n t s ( b y d e fa u lt , 8 g ig a b yt e s ) . As u s u a l, h o we ve r, t h e s ys t e m a d m in is t ra t o r ca n t u n e t h e s e va lu e s b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ s h m m n i, / p ro c/ s y s / k e rn e l/ s h m m a x , a n d / p ro c/ s y s / k e rn e l/ s h m a ll file s , re s p e ct ive ly. Fig u re 1 9 - 3 . I P C s h a re d m e m o ry d a t a s t ru c t u re s

Th e d a t a s t ru ct u re s a s s o cia t e d wit h IPC s h a re d m e m o ry re g io n s a re s h o wn in Fig u re 1 9 - 3 . Th e shm_ids va ria b le s t o re s t h e ipc_ids d a t a s t ru ct u re o f t h e IPC s h a re d m e m o ry re s o u rce t yp e ; it s entries fie ld is a n a rra y o f p o in t e rs t o shmid_kernel d a t a s t ru ct u re s , o n e it e m fo r e ve ry IPC s h a re d m e m o ry re s o u rce . Fo rm a lly, t h e a rra y s t o re s p o in t e rs t o kern_ipc_perm d a t a s t ru ct u re s , b u t e a ch s u ch s t ru ct u re is s im p ly t h e firs t fie ld o f t h e

msg_queue d a t a s t ru ct u re . All fie ld s o f t h e shmid_kernel d a t a s t ru ct u re a re s h o wn in Ta b le 1 9 - 1 3 .

Ta b le 1 9 - 1 3 . Th e fie ld s in t h e s h m id _ k e rn e l d a t a s t ru c t u re

Ty p e

Fie ld

D e s c rip t io n

struct kern_ipc_perm

shm_perm

kern_ipc_perm d a t a s t ru ct u re

struct file *

shm_file

S p e cia l file o f t h e s e g m e n t

int

id

S lo t in d e x o f t h e s e g m e n t

unsigned long

shm_nattch

Nu m b e r o f cu rre n t a t t a ch e s

unsigned long

shm_segsz

S e g m e n t s ize in b yt e s

long

shm_atime

La s t a cce s s t im e

long

shm_dtime

La s t d e t a ch t im e

long

shm_ctime

La s t ch a n g e t im e

int

shm_cprid

PID o f cre a t o r

int

shm_lprid

PID o f la s t a cce s s in g p ro ce s s

Th e m o s t im p o rt a n t fie ld is shm_file, wh ich s t o re s t h e a d d re s s o f a file o b je ct . Th is re fle ct s t h e t ig h t in t e g ra t io n o f IPC s h a re d m e m o ry wit h t h e VFS la ye r in Lin u x 2 . 4 . In p a rt icu la r, e a ch IPC s h a re d m e m o ry re g io n is a s s o cia t e d wit h a re g u la r file b e lo n g in g t o t h e s h m s p e cia l file s ys t e m ( s e e S e ct io n 1 2 . 3 . 1 ) . S in ce t h e s h m file s ys t e m h a s n o m o u n t p o in t in t h e s ys t e m d ire ct o ry t re e , n o u s e r ca n o p e n a n d a cce s s it s file s b y m e a n s o f re g u la r VFS s ys t e m ca lls . Ho we ve r, wh e n e ve r a p ro ce s s "a t t a ch e s " a s e g m e n t , t h e ke rn e l in vo ke s do_mmap( ) a n d cre a t e s a n e w s h a re d m e m o ry m a p p in g o f t h e file in t h e a d d re s s s p a ce o f t h e p ro ce s s . Th e re fo re , file s t h a t b e lo n g t o t h e s h m s p e cia l file s ys t e m h a ve ju s t o n e file o b je ct m e t h o d , mmap, wh ich is im p le m e n t e d b y t h e

shm_mmap( ) fu n ct io n .

As s h o wn in Fig u re 1 9 - 3 , a m e m o ry re g io n t h a t co rre s p o n d s t o a n IPC s h a re d m e m o ry re g io n is d e s crib e d b y a vm_area_struct o b je ct ( s e e S e ct io n 1 5 . 2 ) ; it s vm_file fie ld p o in t s b a ck t o t h e file o b je ct o f t h e s p e cia l file , wh ich in t u rn re fe re n ce s a d e n t ry o b je ct a n d a n in o d e o b je ct . Th e in o d e n u m b e r, s t o re d in t h e i_ino fie ld o f t h e in o d e , is a ct u a lly t h e s lo t in d e x o f t h e IPC s h a re d m e m o ry re g io n , s o t h e in o d e o b je ct in d ire ct ly re fe re n ce s t h e shmid_kernel d e s crip t o r.

As u s u a l, fo r a n y s h a re d m e m o ry m a p p in g , p a g e fra m e s t h a t b e lo n g t o t h e IPC s h a re d m e m o ry re g io n a re in clu d e d in t h e p a g e ca ch e t h ro u g h a n address_space o b je ct re fe re n ce d b y t h e i_mapping fie ld o f t h e in o d e ( yo u m ig h t a ls o re fe r t o Fig u re 1 5 - 4 ) .

19.3.5.1 Swapping out pages of IPC shared memory regions Th e ke rn e l h a s t o b e ca re fu l wh e n s wa p p in g o u t p a g e s in clu d e d in s h a re d m e m o ry re g io n s , a n d t h e ro le o f t h e s wa p ca ch e is cru cia l ( t h is t o p ic wa s a lre a d y d is cu s s e d in S e ct io n 1 6 . 3 ) . As e xp la in e d in S e ct io n 1 6 . 5 . 1 , t o s wa p o u t a p a g e o wn e d b y a n address_space o b je ct , t h e ke rn e l e s s e n t ia lly m a rks t h e p a g e a s d irt y, t h u s t rig g e rin g a d a t a t ra n s fe r t o d is k, a n d t h e n re m o ve s t h e p a g e fro m t h e p ro ce s s 's Pa g e Ta b le . If t h e p a g e b e lo n g s t o a s h a re d file m e m o ry m a p p in g , e ve n t u a lly t h e p a g e is n o lo n g e r re fe re n ce d b y a n y p ro ce s s , a n d t h e shrink_cache( ) fu n ct io n will re le a s e it t o t h e Bu d d y s ys t e m ( s e e S e ct io n 1 6 . 7 . 5 ) . Th is is fin e b e ca u s e t h e d a t a in t h e p a g e is ju s t a d u p lica t e o f s o m e d a t a o n d is k. Ho we ve r, p a g e s o f a n IPC s h a re d m e m o ry re g io n m a p a s p e cia l in o d e t h a t h a s n o im a g e o n d is k. Mo re o ve r, a n IPC s h a re d m e m o ry re g io n is p e rs is t e n t —t h a t is , it s p a g e s m u s t b e p re s e rve d e ve n wh e n t h e s e g m e n t is n o t a t t a ch e d t o a n y p ro ce s s . Th e re fo re , t h e ke rn e l ca n n o t s im p ly d is ca rd t h e p a g e s wh e n re cla im in g t h e co rre s p o n d in g p a g e fra m e s ; ra t h e r, t h e p a g e s h a ve t o b e s wa p p e d o u t . Th e try_to_swap_out( ) fu n ct io n in clu d e s n o ch e ck fo r t h is s p e cia l ca s e , s o a p a g e b e lo n g in g t o t h e re g io n is m a rke d a s d irt y a n d re m o ve d fro m t h e p ro ce s s a d d re s s s p a ce . Eve n t h e shrink_cache( ) fu n ct io n , wh ich p e rio d ica lly p ru n e s t h e p a g e ca ch e fro m t h e le a s t re ce n t ly u s e d p a g e s , h a s n o ch e ck fo r t h is s p e cia l ca s e , s o it e n d s u p in vo kin g t h e writepage m e t h o d o f t h e o wn e r address_space o b je ct ( s e e S e ct io n 1 6 . 7 . 5 ) .

Ho w, t h e n , a re IPC s h a re d m e m o ry re g io n s p re s e rve d wh e n t h e ir p a g e s a re s wa p p e d o u t ? Pa g e s b e lo n g in g t o IPC s h a re d m e m o ry re g io n s im p le m e n t t h e writepage m e t h o d b y m e a n s o f a cu s t o m shmem_writepage( ) fu n ct io n , wh ich e s s e n t ia lly a llo ca t e s a n e w p a g e s lo t in a s wa p a re a a n d m o ve s t h e p a g e fro m t h e p a g e ca ch e t o t h e s wa p ca ch e ( it 's ju s t a m a t t e r o f ch a n g in g t h e o wn e r address_space o b je ct o f t h e p a g e ) . Th e fu n ct io n a ls o s t o re s t h e s wa p p e d - o u t p a g e id e n t ifie r in a shmem_inode_info s t ru ct u re e m b e d d e d in t h e file s ys t e m - s p e cific p o rt io n o f t h e in o d e o b je ct . No t ice t h a t t h e p a g e is n o t im m e d ia t e ly writ t e n o n t o t h e s wa p a re a : t h is is d o n e wh e n t h e shrink_cache( ) is in vo ke d a g a in .

19.3.5.2 Demand paging for IPC shared memory regions Th e p a g e s a d d e d t o a p ro ce s s b y shmat( ) a re d u m m y p a g e s ; t h e fu n ct io n a d d s a n e w m e m o ry re g io n in t o a p ro ce s s 's a d d re s s s p a ce , b u t it d o e s n 't m o d ify t h e p ro ce s s 's Pa g e Ta b le s . Mo re o ve r, a s we h a ve s e e n , p a g e s o f a n IPC s h a re d m e m o ry re g io n ca n b e s wa p p e d

o u t . Th e re fo re , t h e s e p a g e s a re h a n d le d t h ro u g h t h e d e m a n d p a g in g m e ch a n is m . As we kn o w, a Pa g e Fa u lt o ccu rs wh e n a p ro ce s s t rie s t o a cce s s a lo ca t io n o f a n IPC s h a re d m e m o ry re g io n wh o s e u n d e rlyin g p a g e fra m e h a s n o t b e e n a s s ig n e d . Th e co rre s p o n d in g e xce p t io n h a n d le r d e t e rm in e s t h a t t h e fa u lt y a d d re s s is in s id e t h e p ro ce s s a d d re s s s p a ce a n d t h a t t h e co rre s p o n d in g Pa g e Ta b le e n t ry is n u ll; t h e re fo re , it in vo ke s t h e do_no_page( ) fu n ct io n ( s e e S e ct io n 8 . 4 . 3 ) . In t u rn , t h is fu n ct io n ch e cks wh e t h e r t h e nopage m e t h o d fo r t h e m e m o ry re g io n is d e fin e d . Th a t m e t h o d is in vo ke d , a n d t h e Pa g e Ta b le e n t ry is s e t t o t h e a d d re s s re t u rn e d fro m it ( s e e a ls o S e ct io n 1 5 . 2 . 4 ) . Me m o ry re g io n s u s e d fo r IPC s h a re d m e m o ry a lwa ys d e fin e t h e nopage m e t h o d . It is im p le m e n t e d b y t h e shmem_nopage( ) fu n ct io n , wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Wa lks t h e ch a in o f p o in t e rs in t h e VFS o b je ct s a n d d e rive s t h e a d d re s s o f t h e in o d e o b je ct o f t h e IPC s h a re d m e m o ry re s o u rce ( s e e Fig u re 1 9 - 3 ) . 2 . Co m p u t e s t h e lo g ica l p a g e n u m b e r in s id e t h e s e g m e n t fro m t h e vm_start fie ld o f t h e m e m o ry re g io n d e s crip t o r a n d t h e re q u e s t e d a d d re s s . 3 . Ch e cks wh e t h e r t h e p a g e is a lre a d y in clu d e d in t h e s wa p ca ch e ( s e e S e ct io n 1 6 . 3 ) ; if s o , it t e rm in a t e s b y re t u rn in g it s a d d re s s . 4 . Ch e cks wh e t h e r t h e shmem_inode_info e m b e d d e d in t h e in o d e o b je ct s t o re s a s wa p p e d - o u t p a g e id e n t ifie r fo r t h e lo g ica l p a g e n u m b e r. If s o , it p e rfo rm s a s wa p - in o p e ra t io n b y in vo kin g swapin_readahead( ) ( s e e S e ct io n 1 6 . 6 . 1 ) , wa it s u n t il t h e d a t a t ra n s fe r co m p le t e s , a n d t e rm in a t e s b y re t u rn in g t h e a d d re s s o f t h e p a g e . 5 . Ot h e rwis e , t h e p a g e is n o t s t o re d in a s wa p a re a ; t h e re fo re , t h e fu n ct io n a llo ca t e s a n e w p a g e fro m t h e Bu d d y s ys t e m , in s e rt s in t o t h e p a g e ca ch e , a n d re t u rn s it s a d d re s s . Th e do_no_page( ) fu n ct io n s e t s t h e e n t ry t h a t co rre s p o n d s t o t h e fa u lt y a d d re s s in t h e p ro ce s s 's Pa g e Ta b le s o t h a t it p o in t s t o t h e p a g e fra m e re t u rn e d b y t h e m e t h o d . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Chapter 20. Program Execution Th e co n ce p t o f a "p ro ce s s , " d e s crib e d in Ch a p t e r 3 , wa s u s e d in Un ix fro m t h e b e g in n in g t o re p re s e n t t h e b e h a vio r o f g ro u p s o f ru n n in g p ro g ra m s t h a t co m p e t e fo r s ys t e m re s o u rce s . Th is fin a l ch a p t e r fo cu s e s o n t h e re la t io n s h ip b e t we e n p ro g ra m a n d p ro ce s s . We s p e cifica lly d e s crib e h o w t h e ke rn e l s e t s u p t h e e xe cu t io n co n t e xt fo r a p ro ce s s a cco rd in g t o t h e co n t e n t s o f t h e p ro g ra m file . Wh ile it m a y n o t s e e m like a b ig p ro b le m t o lo a d a b u n ch o f in s t ru ct io n s in t o m e m o ry a n d p o in t t h e CPU t o t h e m , t h e ke rn e l h a s t o d e a l wit h fle xib ilit y in s e ve ra l a re a s : Diffe re n t e x e cu t a b le fo rm a t s Lin u x is d is t in g u is h e d b y it s a b ilit y t o ru n b in a rie s t h a t we re co m p ile d fo r o t h e r o p e ra t in g s ys t e m s . S h a re d lib ra rie s Ma n y e xe cu t a b le file s d o n 't co n t a in a ll t h e co d e re q u ire d t o ru n t h e p ro g ra m b u t e xp e ct t h e ke rn e l t o lo a d in fu n ct io n s fro m a lib ra ry a t ru n t im e . Ot h e r in fo rm a t io n in t h e e x e cu t io n co n t e x t Th is in clu d e s t h e co m m a n d - lin e a rg u m e n t s a n d e n viro n m e n t va ria b le s fa m ilia r t o p ro g ra m m e rs . A p ro g ra m is s t o re d o n d is k a s a n e x e cu t a b le file , wh ich in clu d e s b o t h t h e o b je ct co d e o f t h e fu n ct io n s t o b e e xe cu t e d a n d t h e d a t a o n wh ich t h e s e fu n ct io n s will a ct . Ma n y fu n ct io n s o f t h e p ro g ra m a re s e rvice ro u t in e s a va ila b le t o a ll p ro g ra m m e rs ; t h e ir o b je ct co d e is in clu d e d in s p e cia l file s ca lle d "lib ra rie s . " Act u a lly, t h e co d e o f a lib ra ry fu n ct io n m a y e it h e r b e s t a t ica lly co p ie d in t h e e xe cu t a b le file ( s t a t ic lib ra rie s ) o r lin ke d t o t h e p ro ce s s a t ru n t im e ( s h a re d lib ra rie s , s in ce t h e ir co d e ca n b e s h a re d b y s e ve ra l in d e p e n d e n t p ro ce s s e s ) . Wh e n la u n ch in g a p ro g ra m , t h e u s e r m a y s u p p ly t wo kin d s o f in fo rm a t io n t h a t a ffe ct t h e wa y it is e xe cu t e d : co m m a n d - lin e a rg u m e n t s a n d e n viro n m e n t va ria b le s . Co m m a n d - lin e a rg u m e n t s a re t yp e d in b y t h e u s e r fo llo win g t h e e xe cu t a b le file n a m e a t t h e s h e ll p ro m p t . En v iro n m e n t v a ria b le s , s u ch a s HOME a n d PATH, a re in h e rit e d fro m t h e s h e ll, b u t t h e u s e rs m a y m o d ify t h e va lu e s o f a n y s u ch va ria b le s b e fo re t h e y la u n ch t h e p ro g ra m . In S e ct io n 2 0 . 1 , we e xp la in wh a t a p ro g ra m e xe cu t io n co n t e xt is . In S e ct io n 2 0 . 2 , we m e n t io n s o m e o f t h e e xe cu t a b le fo rm a t s s u p p o rt e d b y Lin u x a n d s h o w h o w Lin u x ca n ch a n g e it s "p e rs o n a lit y" t o e xe cu t e p ro g ra m s co m p ile d fo r o t h e r o p e ra t in g s ys t e m s . Fin a lly, in S e ct io n 2 0 . 4 , we d e s crib e t h e s ys t e m ca ll t h a t a llo ws a p ro ce s s t o s t a rt e xe cu t in g a n e w p ro g ra m .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

20.1 Executable Files Ch a p t e r 1 d e fin e d a p ro ce s s a s a n "e xe cu t io n co n t e xt . " By t h is we m e a n t h e co lle ct io n o f in fo rm a t io n n e e d e d t o ca rry o n a s p e cific co m p u t a t io n ; it in clu d e s t h e p a g e s a cce s s e d , t h e o p e n file s , t h e h a rd wa re re g is t e r co n t e n t s , a n d s o o n . An e x e cu t a b le file is a re g u la r file t h a t d e s crib e s h o w t o in it ia lize a n e w e xe cu t io n co n t e xt ( i. e . , h o w t o s t a rt a n e w co m p u t a t io n ) . S u p p o s e a u s e r wa n t s t o lis t t h e file s in t h e cu rre n t d ire ct o ry; h e kn o ws t h a t t h is re s u lt ca n b e s im p ly a ch ie ve d b y t yp in g t h e file n a m e o f t h e / b in / ls [1] e xt e rn a l co m m a n d a t t h e s h e ll p ro m p t . Th e co m m a n d s h e ll fo rks a n e w p ro ce s s , wh ich in t u rn in vo ke s a n execve( ) s ys t e m ca ll ( s e e S e ct io n 2 0 . 4 la t e r in t h is ch a p t e r) , p a s s in g a s o n e o f it s p a ra m e t e rs a s t rin g t h a t in clu d e s t h e fu ll p a t h n a m e fo r t h e ls e xe cu t a b le file —/ b in / ls , in t h is ca s e . Th e sys_execve( ) s e rvice ro u t in e fin d s t h e co rre s p o n d in g file , ch e cks t h e e xe cu t a b le fo rm a t , a n d m o d ifie s t h e e xe cu t io n co n t e xt o f t h e cu rre n t p ro ce s s a cco rd in g t o t h e in fo rm a t io n s t o re d in it . As a re s u lt , wh e n t h e s ys t e m ca ll t e rm in a t e s , t h e p ro ce s s s t a rt s e xe cu t in g t h e co d e s t o re d in t h e e xe cu t a b le file , wh ich p e rfo rm s t h e d ire ct o ry lis t in g . [1]

Th e p a t h n a m e s o f e xe cu t a b le file s a re n o t fixe d in Lin u x; t h e y d e p e n d o n t h e d is t rib u t io n u s e d . S e ve ra l s t a n d a rd n a m in g s ch e m e s , s u ch a s FHS , h a ve b e e n p ro p o s e d fo r a ll Un ix s ys t e m s .

Wh e n a p ro ce s s s t a rt s ru n n in g a n e w p ro g ra m , it s e xe cu t io n co n t e xt ch a n g e s d ra s t ica lly s in ce m o s t o f t h e re s o u rce s o b t a in e d d u rin g t h e p ro ce s s 's p re vio u s co m p u t a t io n s a re d is ca rd e d . In t h e p re ce d in g e xa m p le , wh e n t h e p ro ce s s s t a rt s e xe cu t in g / b in / ls , it re p la ce s t h e s h e ll's a rg u m e n t s wit h n e w o n e s p a s s e d a s p a ra m e t e rs in t h e execve( ) s ys t e m ca ll a n d a cq u ire s a n e w s h e ll e n viro n m e n t ( s e e t h e la t e r s e ct io n S e ct io n 2 0 . 1 . 2 ) . All p a g e s in h e rit e d fro m t h e p a re n t ( a n d s h a re d wit h t h e Co p y On Writ e m e ch a n is m ) a re re le a s e d s o t h a t t h e n e w co m p u t a t io n s t a rt s wit h a fre s h Us e r Mo d e a d d re s s s p a ce ; e ve n t h e p rivile g e s o f t h e p ro ce s s co u ld ch a n g e ( s e e t h e la t e r s e ct io n S e ct io n 2 0 . 1 . 1 ) . Ho we ve r, t h e p ro ce s s PID d o e s n 't ch a n g e , a n d t h e n e w co m p u t a t io n in h e rit s fro m t h e p re vio u s o n e a ll o p e n file d e s crip t o rs t h a t we re n o t clo s e d a u t o m a t ica lly wh ile e xe cu t in g t h e execve( ) s ys t e m ca ll. [ 2 ] [2]

By d e fa u lt , a file a lre a d y o p e n e d b y a p ro ce s s s t a ys o p e n a ft e r is s u in g a n execve( )s ys t e m ca ll. Ho we ve r, t h e file is a u t o m a t ica lly clo s e d if t h e p ro ce s s h a s s e t t h e co rre s p o n d in g b it in t h e close_on_exec fie ld o f t h e files_struct s t ru ct u re ( s e e Ta b le 1 2 - 7 in Ch a p t e r 1 2 ) ; t h is is d o n e b y m e a n s o f t h e fcntl( )s ys t e m ca ll. 20.1.1 Process Credentials and Capabilities Tra d it io n a lly, Un ix s ys t e m s a s s o cia t e wit h e a ch p ro ce s s s o m e cre d e n t ia ls , wh ich b in d t h e p ro ce s s t o a s p e cific u s e r a n d a s p e cific u s e r g ro u p . Cre d e n t ia ls a re im p o rt a n t o n m u lt iu s e r s ys t e m s b e ca u s e t h e y d e t e rm in e wh a t e a ch p ro ce s s ca n o r ca n n o t d o , t h u s p re s e rvin g b o t h t h e in t e g rit y o f e a ch u s e r's p e rs o n a l d a t a a n d t h e s t a b ilit y o f t h e s ys t e m a s a wh o le . Th e u s e o f cre d e n t ia ls re q u ire s s u p p o rt b o t h in t h e p ro ce s s d a t a s t ru ct u re a n d in t h e

re s o u rce s b e in g p ro t e ct e d . On e o b vio u s re s o u rce is a file . Th u s , in t h e Ext 2 file s ys t e m , e a ch file is o wn e d b y a s p e cific u s e r a n d is b o u n d t o a g ro u p o f u s e rs . Th e o wn e r o f a file m a y d e cid e wh a t kin d o f o p e ra t io n s a re a llo we d o n t h a t file , d is t in g u is h in g a m o n g h e rs e lf, t h e file 's u s e r g ro u p , a n d a ll o t h e r u s e rs . Wh e n a p ro ce s s t rie s t o a cce s s a file , t h e VFS a lwa ys ch e cks wh e t h e r t h e a cce s s is le g a l, a cco rd in g t o t h e p e rm is s io n s e s t a b lis h e d b y t h e file o wn e r a n d t h e p ro ce s s cre d e n t ia ls . Th e p ro ce s s 's cre d e n t ia ls a re s t o re d in s e ve ra l fie ld s o f t h e p ro ce s s d e s crip t o r, lis t e d in Ta b le 2 0 - 1 . Th e s e fie ld s co n t a in id e n t ifie rs o f u s e rs a n d u s e r g ro u p s in t h e s ys t e m , wh ich a re u s u a lly co m p a re d wit h t h e co rre s p o n d in g id e n t ifie rs s t o re d in t h e in o d e s o f t h e file s b e in g a cce s s e d .

Ta b le 2 0 - 1 . Tra d it io n a l p ro c e s s c re d e n t ia ls

Na m e

D e s c rip t io n

uid, gid

Us e r a n d g ro u p re a l id e n t ifie rs

euid, egid

Us e r a n d g ro u p e ffe ct ive id e n t ifie rs

fsuid, fsgid

Us e r a n d g ro u p e ffe ct ive id e n t ifie rs fo r file a cce s s

groups

S u p p le m e n t a ry g ro u p id e n t ifie rs

suid, sgid

Us e r a n d g ro u p s a ve d id e n t ifie rs

A UID o f 0 s p e cifie s t h e s u p e ru s e r ( ro o t ) , wh ile a GID o f 0 s p e cifie s t h e ro o t g ro u p . If a p ro ce s s cre d e n t ia l s t o re s a va lu e o f 0 , t h e ke rn e l b yp a s s e s t h e p e rm is s io n ch e cks a n d a llo ws t h e p rivile g e d p ro ce s s t o p e rfo rm va rio u s a ct io n s , s u ch a s t h o s e re fe rrin g t o s ys t e m a d m in is t ra t io n o r h a rd wa re m a n ip u la t io n , t h a t a re n o t p o s s ib le t o u n p rivile g e d p ro ce s s e s . Wh e n a p ro ce s s is cre a t e d , it a lwa ys in h e rit s t h e cre d e n t ia ls o f it s p a re n t . Ho we ve r, t h e s e cre d e n t ia ls ca n b e m o d ifie d la t e r, e it h e r wh e n t h e p ro ce s s s t a rt s e xe cu t in g a n e w p ro g ra m o r wh e n it is s u e s s u it a b le s ys t e m ca lls . Us u a lly, t h e uid, euid, fsuid, a n d suid fie ld s o f a p ro ce s s co n t a in t h e s a m e va lu e . Wh e n t h e p ro ce s s e xe cu t e s a s e t u id p ro g ra m —t h a t is , a n e xe cu t a b le file wh o s e s e t u id fla g is o n —t h e euid a n d fsuid fie ld s a re s e t t o t h e id e n t ifie r o f t h e file 's o wn e r. Alm o s t a ll ch e cks in vo lve o n e o f t h e s e t wo fie ld s : fsuid is u s e d fo r file re la t e d o p e ra t io n s , wh ile euid is u s e d fo r a ll o t h e r o p e ra t io n s . S im ila r co n s id e ra t io n s a p p ly t o t h e gid, egid, fsgid, a n d sgid fie ld s t h a t re fe r t o g ro u p id e n t ifie rs .

As a n illu s t ra t io n o f h o w t h e fsuid fie ld is u s e d , co n s id e r t h e t yp ica l s it u a t io n wh e n a u s e r wa n t s t o ch a n g e h is p a s s wo rd . All p a s s wo rd s a re s t o re d in a co m m o n file , b u t h e ca n n o t d ire ct ly e d it t h is file b e ca u s e it is p ro t e ct e d . Th e re fo re , h e in vo ke s a s ys t e m p ro g ra m n a m e d / u s r/ b in / p a s s w d , wh ich h a s t h e s e t u id fla g s e t a n d wh o s e o wn e r is t h e s u p e ru s e r. Wh e n t h e p ro ce s s fo rke d b y t h e s h e ll e xe cu t e s s u ch a p ro g ra m , it s euid a n d fsuid fie ld s a re s e t t o 0 —t o t h e PID o f t h e s u p e ru s e r. No w t h e p ro ce s s ca n a cce s s t h e file , s in ce , wh e n t h e ke rn e l

p e rfo rm s t h e a cce s s co n t ro l, it fin d s a 0 va lu e in fsuid. Of co u rs e , t h e / u s r/ b in / p a s s w d p ro g ra m d o e s n o t a llo w t h e u s e r t o d o a n yt h in g b u t ch a n g e h is o wn p a s s wo rd . Un ix's lo n g h is t o ry t e a ch e s t h e le s s o n t h a t s e t u id p ro g ra m s a re q u it e d a n g e ro u s : m a licio u s u s e rs co u ld t rig g e r s o m e p ro g ra m m in g e rro rs ( b u g s ) in t h e co d e t o fo rce s e t u id p ro g ra m s t o p e rfo rm o p e ra t io n s t h a t we re n e ve r p la n n e d b y t h e p ro g ra m 's o rig in a l d e s ig n e rs . In t h e wo rs t ca s e , t h e e n t ire s ys t e m 's s e cu rit y ca n b e co m p ro m is e d . To m in im ize s u ch ris ks , Lin u x, like a ll m o d e rn Un ix s ys t e m s , a llo ws p ro ce s s e s t o a cq u ire s e t u id p rivile g e s o n ly wh e n n e ce s s a ry a n d d ro p t h e m wh e n t h e y a re n o lo n g e r n e e d e d . Th is fe a t u re m a y t u rn o u t t o b e u s e fu l wh e n im p le m e n t in g u s e r a p p lica t io n s wit h s e ve ra l p ro t e ct io n le ve ls . Th e p ro ce s s d e s crip t o r in clu d e s a n suid fie ld , wh ich s t o re s t h e va lu e s o f t h e e ffe ct ive id e n t ifie rs ( euid a n d fsuid) rig h t a ft e r t h e e xe cu t io n o f t h e s e t u id p ro g ra m . Th e p ro ce s s ca n ch a n g e t h e e ffe ct ive id e n t ifie rs b y m e a n s o f t h e setuid( ), setresuid( ), setfsuid( ), a n d

setreuid( ) s ys t e m ca lls . [ 3 ] [3]

GID e ffe ct ive cre d e n t ia ls ca n b e ch a n g e d b y is s u in g t h e co rre s p o n d in g setgid( ), setresgid( ), setfsgid( ), a n d setregid( ) s ys t e m ca lls . Ta b le 2 0 - 2 s h o ws h o w t h e s e s ys t e m ca lls a ffe ct t h e p ro ce s s 's cre d e n t ia ls . Be wa rn e d t h a t if t h e ca llin g p ro ce s s d o e s n o t a lre a d y h a ve s u p e ru s e r p rivile g e s —t h a t is , if it s euid fie ld is n o t n u ll—t h e s e s ys t e m ca lls ca n b e u s e d o n ly t o s e t va lu e s a lre a d y in clu d e d in t h e p ro ce s s 's cre d e n t ia l fie ld s . Fo r in s t a n ce , a n a ve ra g e u s e r p ro ce s s ca n s t o re t h e va lu e 500 in t o it s

fsuid fie ld b y in vo kin g t h e setfsuid( ) s ys t e m ca ll, b u t o n ly if o n e o f t h e o t h e r cre d e n t ia l fie ld s a lre a d y h o ld s t h e s a m e va lu e .

Ta b le 2 0 - 2 . S e m a n t ic s o f t h e s y s t e m c a lls t h a t s e t p ro c e s s c re d e n t ia ls

s e t u id ( e )

Fie ld

e u id = 0

e u id

uid

Se t to e

euid

0

s e t re s u id ( u , e , s )

s e t re u id ( u , e )

s e t fs u id ( f)

Un ch a n g e d

Se t to u

Se t to u

Un ch a n g e d

Se t to e

Se t to e

Se t to e

Se t to e

Un ch a n g e d

fsuid

Se t to e

Se t to e

Se t to e

Se t to e

Se t to f

suid

Se t to e

Un ch a n g e d

Se t to s

Se t to e

Un ch a n g e d

To u n d e rs t a n d t h e s o m e t im e s co m p le x re la t io n s h ip s a m o n g t h e fo u r u s e r ID fie ld s , co n s id e r fo r a m o m e n t t h e e ffe ct s o f t h e setuid( ) s ys t e m ca ll. Th e a ct io n s a re d iffe re n t , d e p e n d in g o n wh e t h e r t h e ca llin g p ro ce s s 's euid fie ld is s e t t o 0 ( t h a t is , t h e p ro ce s s h a s s u p e ru s e r p rivile g e s ) o r t o a n o rm a l UID.

If t h e euid fie ld is 0 , t h e s ys t e m ca ll s e t s a ll cre d e n t ia l fie ld s o f t h e ca llin g p ro ce s s ( uid,

euid, fsuid, a n d suid) t o t h e va lu e o f t h e p a ra m e t e r e. A s u p e ru s e r p ro ce s s ca n t h u s d ro p it s p rivile g e s a n d b e co m e a p ro ce s s o wn e d b y a n o rm a l u s e r. Th is h a p p e n s , fo r in s t a n ce , wh e n a u s e r lo g s in : t h e s ys t e m fo rks a n e w p ro ce s s wit h s u p e ru s e r p rivile g e s , b u t t h e p ro ce s s d ro p s it s p rivile g e s b y in vo kin g t h e setuid( ) s ys t e m ca ll a n d t h e n s t a rt s e xe cu t in g t h e u s e r's lo g in s h e ll p ro g ra m . If t h e euid fie ld is n o t 0 , t h e s ys t e m ca ll m o d ifie s o n ly t h e va lu e s t o re d in euid a n d fsuid, le a vin g t h e o t h e r t wo fie ld s u n ch a n g e d . Th is a llo ws a p ro ce s s e xe cu t in g a s e t u id p ro g ra m t o h a ve it s e ffe ct ive p rivile g e s s t o re d in euid a n d fsuid s e t a lt e rn a t e ly t o uid ( t h e p ro ce s s a ct s a s t h e u s e r wh o la u n ch e d t h e e xe cu t a b le file ) a n d t o suid ( t h e p ro ce s s a ct s a s t h e u s e r wh o o wn s t h e e xe cu t a b le file ) .

20.1.1.1 Process capabilities Lin u x is m o vin g t o wa rd a n o t h e r m o d e l o f p ro ce s s cre d e n t ia ls b a s e d o n t h e n o t io n o f "ca p a b ilit ie s . " A ca p a b ilit y is s im p ly a fla g t h a t a s s e rt s wh e t h e r t h e p ro ce s s is a llo we d t o p e rfo rm a s p e cific o p e ra t io n o r a s p e cific cla s s o f o p e ra t io n s . Th is m o d e l is d iffe re n t fro m t h e t ra d it io n a l "s u p e ru s e r ve rs u s n o rm a l u s e r" m o d e l in wh ich a p ro ce s s ca n e it h e r d o e ve ryt h in g o r d o n o t h in g , d e p e n d in g o n it s e ffe ct ive UID. As illu s t ra t e d in Ta b le 2 0 - 3 , s e ve ra l ca p a b ilit ie s h a ve a lre a d y b e e n in clu d e d in t h e Lin u x ke rn e l.

Ta b le 2 0 - 3 . Lin u x c a p a b ilit ie s

Na m e

D e s c rip t io n

CAP_CHOWN

Ig n o re re s t rict io n s o n file u s e r a n d g ro u p o wn e rs h ip ch a n g e s .

CAP_DAC_OVERRIDE

Ig n o re file a cce s s p e rm is s io n s .

CAP_DAC_READ_SEARCH Ig n o re file / d ire ct o ry re a d a n d s e a rch p e rm is s io n s . CAP_FOWNER

Ge n e ra lly ig n o re p e rm is s io n ch e cks o n file o wn e rs h ip .

CAP_FSETID

Ig n o re re s t rict io n s o n s e t t in g t h e s e t u id a n d s e t g id fla g s fo r file s .

CAP_KILL

Byp a s s p e rm is s io n ch e cks wh e n g e n e ra t in g s ig n a ls .

CAP_IPC_LOCK

Allo w lo ckin g o f p a g e s a n d o f s h a re d m e m o ry s e g m e n t s .

CAP_IPC_OWNER

S kip IPC o wn e rs h ip ch e cks .

CAP_ LEAS E

Allo w t a kin g o f le a s e s o n file s ( s e e S e ct io n 1 2 . 7 . 1 ) .

CAP_LINUX_IMMUTABLE Allo w m o d ifica t io n o f a p p e n d - o n ly a n d im m u t a b le Ext 2 / Ext 3 file s .

CAP_ MKNOD

Allo w p rivile g e d mknod( ) o p e ra t io n s .

CAP_NET_ADMIN

Allo w g e n e ra l n e t wo rkin g a d m in is t ra t io n .

CAP_NET_BIND_SERVICE Allo w b in d in g t o TCP/ UDP s o cke t s b e lo w 1 , 0 2 4 . CAP_NET_BROADCAST

Cu rre n t ly u n u s e d .

CAP_NET_RAW

Allo w u s e o f RAW a n d PACKET s o cke t s .

CAP_SETGID

Ig n o re re s t rict io n s o n g ro u p 's p ro ce s s cre d e n t ia ls m a n ip u la t io n s .

CAP_SETPCAP

Allo w ca p a b ilit y m a n ip u la t io n s .

CAP_SETUID

Ig n o re re s t rict io n s o n u s e r's p ro ce s s cre d e n t ia ls m a n ip u la t io n s .

CAP_SYS_ADMIN

Allo w g e n e ra l s ys t e m a d m in is t ra t io n .

CAP_SYS_BOOT

Allo w u s e o f reboot( ).

CAP_SYS_CHROOT

Allo w u s e o f chroot( ).

CAP_SYS_MODULE

Allo w in s e rt in g a n d re m o vin g o f ke rn e l m o d u le s .

CAP_SYS_NICE

S kip p e rm is s io n ch e cks o f t h e nice( ) a n d setpriority( ) s ys t e m ca lls , a n d a llo w cre a t io n o f re a l- t im e p ro ce s s e s .

CAP_SYS_PACCT

Allo w co n fig u ra t io n o f p ro ce s s a cco u n t in g .

CAP_SYS_PTRACE

Allo w u s e o f ptrace( ) o n a n y p ro ce s s .

CAP_SYS_RAWIO

Allo w a cce s s t o I/ O p o rt s t h ro u g h ioperm( ) a n d iopl( ).

CAP_SYS_RESOURCE

Allo w re s o u rce lim it s t o b e in cre a s e d .

CAP_SYS_TIME

Allo w m a n ip u la t io n o f s ys t e m clo ck a n d re a l- t im e clo ck.

CAP_SYS_TTY_CONFIG

Allo w e xe cu t io n o f t h e vhangup( ) s ys t e m ca ll t o co n fig u re t h e t e rm in a l.

Th e m a in a d va n t a g e o f ca p a b ilit ie s is t h a t , a t a n y t im e , e a ch p ro g ra m n e e d s a lim it e d n u m b e r o f t h e m . Co n s e q u e n t ly, e ve n if a m a licio u s u s e r d is co ve rs a wa y t o e xp lo it a b u g g y p ro g ra m , s h e ca n ille g a lly p e rfo rm o n ly a lim it e d s e t o f o p e ra t io n s . As s u m e , fo r in s t a n ce , t h a t a b u g g y p ro g ra m h a s o n ly t h e CAP_SYS_TIME ca p a b ilit y. In t h is ca s e , t h e m a licio u s u s e r wh o d is co ve rs a n e xp lo it a t io n o f t h e b u g ca n s u cce e d o n ly in ille g a lly ch a n g in g t h e re a l- t im e clo ck a n d t h e s ys t e m clo ck. S h e wo n 't b e a b le t o p e rfo rm a n y o t h e r kin d o f p rivile g e d o p e ra t io n s . Ne it h e r t h e VFS n o r t h e Ext 2 file s ys t e m cu rre n t ly s u p p o rt s t h e ca p a b ilit y m o d e l, s o t h e re is n o wa y t o a s s o cia t e a n e xe cu t a b le file wit h t h e s e t o f ca p a b ilit ie s t h a t s h o u ld b e e n fo rce d wh e n a p ro ce s s e xe cu t e s t h a t file . Ne ve rt h e le s s , a p ro ce s s ca n e xp licit ly g e t a n d s e t it s ca p a b ilit ie s b y u s in g , re s p e ct ive ly, t h e capget( ) a n d capset( ) s ys t e m ca lls , p ro vid e d t h a t t h e p ro ce s s a lre a d y o wn s t h e CAP_SETPCAP ca p a b ilit y. Fo r in s t a n ce , it is p o s s ib le t o m o d ify t h e lo g in p ro g ra m t o re t a in a s u b s e t o f t h e ca p a b ilit ie s a n d d ro p t h e o t h e rs . Th e Lin u x ke rn e l a lre a d y t a ke s ca p a b ilit ie s in t o a cco u n t . Le t 's co n s id e r, fo r in s t a n ce , t h e nice( ) s ys t e m ca ll, wh ich a llo ws u s e rs t o ch a n g e t h e s t a t ic p rio rit y o f a p ro ce s s . In t h e t ra d it io n a l m o d e l, o n ly t h e s u p e ru s e r ca n ra is e a p rio rit y; t h e ke rn e l s h o u ld t h e re fo re ch e ck wh e t h e r t h e euid fie ld in t h e d e s crip t o r o f t h e ca llin g p ro ce s s is s e t t o 0 . Ho we ve r, t h e Lin u x ke rn e l d e fin e s a ca p a b ilit y ca lle d CAP_SYS_NICE, wh ich co rre s p o n d s e xa ct ly t o t h is kin d o f o p e ra t io n . Th e ke rn e l ch e cks t h e va lu e o f t h is fla g b y in vo kin g t h e capable( ) fu n ct io n a n d p a s s in g t h e CAP_SYS_NICE va lu e t o it .

Th is a p p ro a ch wo rks t h a n ks t o s o m e "co m p a t ib ilit y h a cks " t h a t h a ve b e e n a d d e d t o t h e ke rn e l co d e : e a ch t im e a p ro ce s s s e t s t h e euid a n d fsuid fie ld s t o 0 ( e it h e r b y in vo kin g o n e o f t h e s ys t e m ca lls lis t e d in Ta b le 2 0 - 2 o r b y e xe cu t in g a s e t u id p ro g ra m o wn e d b y t h e s u p e ru s e r) , t h e ke rn e l s e t s a ll p ro ce s s ca p a b ilit ie s s o t h a t a ll ch e cks will s u cce e d . Wh e n t h e p ro ce s s re s e t s t h e euid a n d fsuid fie ld s t o t h e re a l UID o f t h e p ro ce s s o wn e r, t h e ke rn e l ch e cks t h e keep_capabilities fla g in t h e p ro ce s s d e s crip t o r a n d d ro p s a ll ca p a b ilit ie s o f t h e p ro ce s s if t h e fla g is s e t . A p ro ce s s ca n s e t a n d re s e t t h e keep_capabilities fla g b y m e a n s o f t h e Lin u x- s p e cific prctl( ) s ys t e m ca ll.

20.1.2 Command-Line Arguments and Shell Environment Wh e n a u s e r t yp e s a co m m a n d , t h e p ro g ra m t h a t is lo a d e d t o s a t is fy t h e re q u e s t m a y re ce ive s o m e co m m a n d - lin e a rg u m e n t s fro m t h e s h e ll. Fo r e xa m p le , wh e n a u s e r t yp e s t h e co m m a n d :

$ ls -l /usr/bin t o g e t a fu ll lis t in g o f t h e file s in t h e / u s r/ b in d ire ct o ry, t h e s h e ll p ro ce s s cre a t e s a n e w p ro ce s s t o e xe cu t e t h e co m m a n d . Th is n e w p ro ce s s lo a d s t h e / b in / ls e xe cu t a b le file . In d o in g s o , m o s t o f t h e e xe cu t io n co n t e xt in h e rit e d fro m t h e s h e ll is lo s t , b u t t h e t h re e s e p a ra t e a rg u m e n t s ls, -l, a n d /usr/bin a re ke p t . Ge n e ra lly, t h e n e w p ro ce s s m a y re ce ive a n y n u m b e r o f a rg u m e n t s .

Th e co n ve n t io n s fo r p a s s in g t h e co m m a n d - lin e a rg u m e n t s d e p e n d o n t h e h ig h - le ve l la n g u a g e u s e d . In t h e C la n g u a g e , t h e main( ) fu n ct io n o f a p ro g ra m m a y re ce ive a s p a ra m e t e rs a n in t e g e r s p e cifyin g h o w m a n y a rg u m e n t s h a ve b e e n p a s s e d t o t h e p ro g ra m a n d t h e a d d re s s o f a n a rra y o f p o in t e rs t o s t rin g s . Th e fo llo win g p ro t o t yp e fo rm a lize s t h is s t a n d a rd :

int main(int argc, char *argv[]) Go in g b a ck t o t h e p re vio u s e xa m p le , wh e n t h e / b in / ls p ro g ra m is in vo ke d , argc h a s t h e va lu e 3 , argv[0] p o in t s t o t h e ls s t rin g , argv[1] p o in t s t o t h e -l s t rin g , a n d argv[2] p o in t s t o t h e /usr/bin s t rin g . Th e e n d o f t h e argv a rra y is a lwa ys m a rke d b y a n u ll p o in t e r, s o argv[3] co n t a in s NULL.

A t h ird o p t io n a l p a ra m e t e r t h a t m a y b e p a s s e d in t h e C la n g u a g e t o t h e main( ) fu n ct io n is t h e p a ra m e t e r co n t a in in g e n v iro n m e n t v a ria b le s . Th e y a re u s e d t o cu s t o m ize t h e e xe cu t io n co n t e xt o f a p ro ce s s , t o p ro vid e g e n e ra l in fo rm a t io n t o a u s e r o r o t h e r p ro ce s s e s , o r t o a llo w a p ro ce s s t o ke e p s o m e in fo rm a t io n a cro s s a n execve( ) s ys t e m ca ll.

To u s e t h e e n viro n m e n t va ria b le s , main( ) ca n b e d e cla re d a s fo llo ws :

int main(int argc, char *argv[], char *envp[]) Th e envp p a ra m e t e r p o in t s t o a n a rra y o f p o in t e rs t o e n viro n m e n t s t rin g s o f t h e fo rm :

VAR_NAME=something wh e re VAR_NAME re p re s e n t s t h e n a m e o f a n e n viro n m e n t va ria b le , wh ile t h e s u b s t rin g fo llo win g t h e = d e lim it e r re p re s e n t s t h e a ct u a l va lu e a s s ig n e d t o t h e va ria b le . Th e e n d o f t h e

envp a rra y is m a rke d b y a n u ll p o in t e r, like t h e argv a rra y. Th e a d d re s s o f t h e envp a rra y is a ls o s t o re d in t h e environ g lo b a l va ria b le o f t h e C lib ra ry. Co m m a n d - lin e a rg u m e n t s a n d e n viro n m e n t s t rin g s a re p la ce d o n t h e Us e r Mo d e s t a ck, rig h t b e fo re t h e re t u rn a d d re s s ( s e e S e ct io n 9 . 2 . 3 ) . Th e b o t t o m lo ca t io n s o f t h e Us e r Mo d e s t a ck a re illu s t ra t e d in Fig u re 2 0 - 1 . No t ice t h a t t h e e n viro n m e n t va ria b le s a re lo ca t e d n e a r t h e b o t t o m o f t h e s t a ck, rig h t a ft e r a 0 lo n g in t e g e r. Fig u re 2 0 - 1 . Th e b o t t o m lo c a t io n s o f t h e Us e r Mo d e s t a c k

20.1.3 Libraries Ea ch h ig h - le ve l s o u rce co d e file is t ra n s fo rm e d t h ro u g h s e ve ra l s t e p s in t o a n o b je ct file , wh ich co n t a in s t h e m a ch in e co d e o f t h e a s s e m b ly la n g u a g e in s t ru ct io n s co rre s p o n d in g t o t h e h ig h - le ve l in s t ru ct io n s . An o b je ct file ca n n o t b e e xe cu t e d , s in ce it d o e s n o t co n t a in t h e lin e a r a d d re s s t h a t co rre s p o n d s t o e a ch re fe re n ce t o a n a m e o f a g lo b a l s ym b o l e xt e rn a l t o t h e s o u rce co d e file , s u ch a s fu n ct io n s in lib ra rie s o r o t h e r s o u rce co d e file s o f t h e s a m e p ro g ra m . Th e a s s ig n in g , o r re s o lu t io n , o f s u ch a d d re s s e s is p e rfo rm e d b y t h e lin ke r, wh ich co lle ct s a ll t h e o b je ct file s o f t h e p ro g ra m a n d co n s t ru ct s t h e e xe cu t a b le file . Th e lin ke r a ls o a n a lyze s t h e lib ra ry's fu n ct io n s u s e d b y t h e p ro g ra m a n d g lu e s t h e m in t o t h e e xe cu t a b le file in a m a n n e r d e s crib e d la t e r in t h is ch a p t e r. Mo s t p ro g ra m s , e ve n t h e m o s t t rivia l o n e s , u s e lib ra rie s . Co n s id e r, fo r in s t a n ce , t h e fo llo win g o n e - lin e C p ro g ra m :

void main(void) { } Alt h o u g h t h is p ro g ra m d o e s n o t co m p u t e a n yt h in g , a lo t o f wo rk is n e e d e d t o s e t u p t h e e xe cu t io n e n viro n m e n t ( s e e S e ct io n 2 0 . 4 la t e r in t h is ch a p t e r) a n d t o kill t h e p ro ce s s wh e n t h e p ro g ra m t e rm in a t e s ( s e e S e ct io n 3 . 5 ) . In p a rt icu la r, wh e n t h e main( ) fu n ct io n t e rm in a t e s , t h e C co m p ile r in s e rt s a n exit( ) fu n ct io n ca ll in t h e o b je ct co d e .

We kn o w fro m Ch a p t e r 9 t h a t p ro g ra m s u s u a lly in vo ke s ys t e m ca lls t h ro u g h wra p p e r ro u t in e s in t h e C lib ra ry. Th is h o ld s fo r t h e C co m p ile r t o o . Be s id e s in clu d in g t h e co d e d ire ct ly g e n e ra t e d b y co m p ilin g t h e p ro g ra m 's s t a t e m e n t s , e a ch e xe cu t a b le file a ls o in clu d e s s o m e "g lu e " co d e t o h a n d le t h e in t e ra ct io n s o f t h e Us e r Mo d e p ro ce s s wit h t h e ke rn e l. Po rt io n s o f s u ch g lu e co d e a re s t o re d in t h e C lib ra ry. Ma n y o t h e r lib ra rie s o f fu n ct io n s , b e s id e s t h e C lib ra ry, a re in clu d e d g e n e ric Lin u x s ys t e m co u ld e a s ily h a ve 5 0 d iffe re n t lib ra rie s . Ju s t t o t h e m : t h e m a t h lib ra ry lib m in clu d e s a d va n ce d fu n ct io n s fo r flo a t in g t h e X1 1 lib ra ry lib X1 1 co lle ct s t o g e t h e r t h e b a s ic lo w- le ve l fu n ct io n s S ys t e m g ra p h ics in t e rfa ce .

in Un ix s ys t e m s . A m e n t io n a co u p le o f p o in t o p e ra t io n s , wh ile fo r t h e X1 1 Win d o w

All e xe cu t a b le file s in t ra d it io n a l Un ix s ys t e m s we re b a s e d o n s t a t ic lib ra rie s . Th is m e a n s t h a t t h e e xe cu t a b le file p ro d u ce d b y t h e lin ke r in clu d e s n o t o n ly t h e co d e o f t h e o rig in a l p ro g ra m b u t a ls o t h e co d e o f t h e lib ra ry fu n ct io n s t h a t t h e p ro g ra m re fe rs t o . On e b ig d is a d va n t a g e o f s t a t ic lib ra rie s is t h a t t h e y e a t lo t s o f s p a ce o n d is k. In d e e d , e a ch s t a t ica lly lin ke d e xe cu t a b le file d u p lica t e s s o m e p o rt io n o f lib ra ry co d e . Mo d e rn Un ix s ys t e m s u s e s h a re d lib ra rie s . Th e e xe cu t a b le file d o e s n o t co n t a in t h e lib ra ry o b je ct co d e , b u t o n ly a re fe re n ce t o t h e lib ra ry n a m e . Wh e n t h e p ro g ra m is lo a d e d in m e m o ry fo r e xe cu t io n , a s u it a b le p ro g ra m ca lle d t h e p ro g ra m in t e rp re t e r ( o r ld . s o ) t a ke s ca re o f a n a lyzin g t h e lib ra ry n a m e s in t h e e xe cu t a b le file , lo ca t in g t h e lib ra ry in t h e s ys t e m 's d ire ct o ry t re e a n d m a kin g t h e re q u e s t e d co d e a va ila b le t o t h e e xe cu t in g p ro ce s s . A p ro ce s s ca n a ls o lo a d a d d it io n a l s h a re d lib ra rie s a t ru n t im e b y u s in g t h e dlopen( ) lib ra ry fu n ct io n .

S h a re d lib ra rie s a re e s p e cia lly co n ve n ie n t o n s ys t e m s t h a t p ro vid e file m e m o ry m a p p in g , s in ce t h e y re d u ce t h e a m o u n t o f m a in m e m o ry re q u e s t e d fo r e xe cu t in g a p ro g ra m . Wh e n t h e p ro g ra m in t e rp re t e r m u s t lin k s o m e s h a re d lib ra ry t o a p ro ce s s , it d o e s n o t co p y t h e o b je ct co d e , b u t ju s t p e rfo rm s a m e m o ry m a p p in g o f t h e re le va n t p o rt io n o f t h e lib ra ry file in t o t h e p ro ce s s 's a d d re s s s p a ce . Th is a llo ws t h e p a g e fra m e s co n t a in in g t h e m a ch in e co d e o f t h e lib ra ry t o b e s h a re d a m o n g a ll p ro ce s s e s t h a t a re u s in g t h e s a m e co d e . S h a re d lib ra rie s a ls o h a ve s o m e d is a d va n t a g e s . Th e s t a rt u p t im e o f a d yn a m ica lly lin ke d p ro g ra m is u s u a lly lo n g e r t h a n t h a t o f a s t a t ica lly lin ke d o n e . Mo re o ve r, d yn a m ica lly lin ke d p ro g ra m s a re n o t a s p o rt a b le a s s t a t ica lly lin ke d o n e s , s in ce t h e y m a y n o t e xe cu t e p ro p e rly in s ys t e m s t h a t in clu d e a d iffe re n t ve rs io n o f t h e s a m e lib ra ry. A u s e r m a y a lwa ys re q u ire a p ro g ra m t o b e lin ke d s t a t ica lly. Fo r e xa m p le , t h e GCC co m p ile r o ffe rs t h e -static o p t io n , wh ich t e lls t h e lin ke r t o u s e t h e s t a t ic lib ra rie s in s t e a d o f t h e s h a re d o n e s .

20.1.4 Program Segments and Process Memory Regions Th e lin e a r a d d re s s s p a ce o f a Un ix p ro g ra m is t ra d it io n a lly p a rt it io n e d , fro m a lo g ica l p o in t o f vie w, in s e ve ra l lin e a r a d d re s s in t e rva ls ca lle d s e g m e n t s : [ 4 ] [4]

Th e wo rd "s e g m e n t " h a s h is t o rica l ro o t s , s in ce t h e firs t Un ix s ys t e m s im p le m e n t e d e a ch lin e a r a d d re s s in t e rva l wit h a d iffe re n t s e g m e n t re g is t e r. Lin u x, h o we ve r, d o e s n o t re ly o n t h e s e g m e n t a t io n m e ch a n is m o f t h e 8 0 x 8 6 m icro p ro ce s s o rs t o im p le m e n t p ro g ra m s e g m e n t s .

Te x t s e g m e n t In clu d e s t h e e xe cu t a b le co d e In it ia liz e d d a t a s e g m e n t Co n t a in s t h e in it ia lize d d a t a —t h a t is , t h e s t a t ic va ria b le s a n d t h e g lo b a l va ria b le s wh o s e in it ia l va lu e s a re s t o re d in t h e e xe cu t a b le file ( b e ca u s e t h e p ro g ra m m u s t kn o w t h e ir va lu e s a t s t a rt u p ) . Un in it ia liz e d d a t a s e g m e n t ( b s s )

Co n t a in s t h e u n in it ia lize d d a t a —t h a t is , a ll g lo b a l va ria b le s wh o s e in it ia l va lu e s a re n o t s t o re d in t h e e xe cu t a b le file ( b e ca u s e t h e p ro g ra m s e t s t h e va lu e s b e fo re re fe re n cin g t h e m ) ; it is h is t o rica lly ca lle d a b s s s e g m e n t . S t a ck s e g m e n t Co n t a in s t h e p ro g ra m s t a ck, wh ich in clu d e s t h e re t u rn a d d re s s e s , p a ra m e t e rs , a n d lo ca l va ria b le s o f t h e fu n ct io n s b e in g e xe cu t e d . Ea ch mm_struct m e m o ry d e s crip t o r ( s e e S e ct io n 8 . 2 ) in clu d e s s o m e fie ld s t h a t id e n t ify t h e ro le o f p a rt icu la r m e m o ry re g io n s o f t h e co rre s p o n d in g p ro ce s s :

start_code, end_code S t o re t h e in it ia l a n d fin a l lin e a r a d d re s s e s o f t h e m e m o ry re g io n t h a t in clu d e s t h e n a t ive co d e o f t h e p ro g ra m —t h e co d e in t h e e xe cu t a b le file . S in ce t h e t e xt s e g m e n t in clu d e s s h a re d lib ra rie s b u t t h e e xe cu t a b le file d o e s n o t , t h e m e m o ry re g io n d e m a rca t e d b y t h e s e fie ld s is a s u b s e t o f t h e t e xt s e g m e n t .

start_data, end_data S t o re t h e in it ia l a n d fin a l lin e a r a d d re s s e s o f t h e m e m o ry re g io n t h a t in clu d e s t h e n a t ive in it ia lize d d a t a o f t h e p ro g ra m , a s s p e cifie d in t h e e xe cu t a b le file . Th e fie ld s id e n t ify a m e m o ry re g io n t h a t ro u g h ly co rre s p o n d s t o t h e d a t a s e g m e n t . Act u a lly, start_data s h o u ld a lm o s t a lwa ys b e s e t t o t h e a d d re s s o f t h e firs t p a g e rig h t a ft e r

end_code, a n d t h u s t h e fie ld is u n u s e d . Th e end_data fie ld is u s e d , t h o u g h . start_brk, brk S t o re t h e in it ia l a n d fin a l lin e a r a d d re s s e s o f t h e m e m o ry re g io n t h a t in clu d e s t h e d yn a m ica lly a llo ca t e d m e m o ry a re a s o f t h e p ro ce s s ( s e e S e ct io n 8 . 6 ) . Th is m e m o ry re g io n is s o m e t im e s ca lle d t h e h e a p .

start_stack S t o re s t h e a d d re s s rig h t a b o ve t h a t o f main( )'s re t u rn a d d re s s ; a s illu s t ra t e d in Fig u re 2 0 - 1 , h ig h e r a d d re s s e s a re re s e rve d ( re ca ll t h a t s t a cks g ro w t o wa rd lo we r a d d re s s e s ) .

arg_start, arg_end S t o re t h e in it ia l a n d fin a l a d d re s s e s o f t h e s t a ck p o rt io n co n t a in in g t h e co m m a n d - lin e a rg u m e n t s .

env_start, env_end S t o re t h e in it ia l a n d fin a l a d d re s s e s o f t h e s t a ck p o rt io n co n t a in in g t h e e n viro n m e n t s t rin g s .

No t ice t h a t s h a re d lib ra rie s a n d file m e m o ry m a p p in g h a ve m a d e t h e cla s s ifica t io n o f t h e p ro ce s s 's a d d re s s s p a ce b a s e d o n p ro g ra m s e g m e n t s a b it o b s o le t e , s in ce e a ch o f t h e s h a re d lib ra rie s is m a p p e d in t o a d iffe re n t m e m o ry re g io n fro m t h o s e d is cu s s e d in t h e p re ce d in g lis t . No w we 'll d e s crib e , b y m e a n s o f a s im p le e xa m p le , h o w t h e Lin u x ke rn e l m a p s s h a re d lib ra rie s in t o t h e p ro ce s s 's a d d re s s s p a ce . We a s s u m e a s u s u a l t h a t t h e Us e r Mo d e a d d re s s s p a ce ra n g e s fro m 0x00000000 a n d 0xbfffffff. We co n s id e r t h e / s b in / in it p ro g ra m , wh ich cre a t e s a n d m o n it o rs t h e a ct ivit y o f a ll t h e p ro ce s s e s t h a t im p le m e n t t h e o u t e r la ye rs o f t h e o p e ra t in g s ys t e m ( s e e S e ct io n 3 . 4 . 2 ) . Th e m e m o ry re g io n s o f t h e co rre s p o n d in g in it p ro ce s s a re s h o wn in Ta b le 2 0 - 4 ( t h is in fo rm a t io n ca n b e o b t a in e d fro m t h e / p ro c/ 1 / m a p s file ; yo u m ig h t s e e a d iffe re n t t a b le , o f co u rs e , d e p e n d in g o n t h e ve rs io n o f t h e in it p ro g ra m a n d h o w it h a s b e e n co m p ile d a n d lin ke d ) . No t ice t h a t a ll re g io n s lis t e d a re im p le m e n t e d b y m e a n s o f p riva t e m e m o ry m a p p in g s ( t h e le t t e r p in t h e Pe rm is s io n s co lu m n ) . Th is is n o t s u rp ris in g b e ca u s e t h e s e m e m o ry re g io n s e xis t o n ly t o p ro vid e d a t a t o a p ro ce s s . Wh ile e xe cu t in g in s t ru ct io n s , a p ro ce s s m a y m o d ify t h e co n t e n t s o f t h e s e m e m o ry re g io n s ; h o we ve r, t h e file s o n d is k a s s o cia t e d wit h t h e m s t a y u n ch a n g e d . Th is is p re cis e ly h o w p riva t e m e m o ry m a p p in g s a ct .

Ta b le 2 0 - 4 . Me m o ry re g io n s o f t h e in it p ro c e s s

Ad d re s s ra n g e

P e rm s

Ma p p e d file

0x08048000-0x0804cfff

r-xp

/ s b in / in it a t o ffs e t 0

0x0804d000-0x0804dfff

rw-p

/ s b in / in it a t o ffs e t 0x4000

0x0804e000-0x0804efff

rwxp

An o n ym o u s

0x40000000-0x40014fff

r-xp

/ lib / ld - 2 . 2 . 3 . s o a t o ffs e t 0

0x40015000-0x40015fff

rw-p

/ lib / ld - 2 . 2 . 3 . s o a t o ffs e t 0x14000

0x40016000-0x40016fff

rw-p

An o n ym o u s

0x40020000-0x40126fff

r-xp

/ lib / lib c. 2 . 2 . 3 . s o a t o ffs e t 0

0x40127000-0x4012cfff

rw-p

/ lib / lib c. 2 . 2 . 3 . s o a t o ffs e t 0x106000

0x4012d000-0x40130fff

rw-p

An o n ym o u s

0xbfffd000-0xbfffffff

rwxp

An o n ym o u s

Th e m e m o ry re g io n s t a rt in g fro m 0x8048000 is a m e m o ry m a p p in g a s s o cia t e d wit h t h e p o rt io n o f t h e / s b in / in it file ra n g in g fro m b yt e 0 t o b yt e 2 0 , 4 7 9 ( o n ly t h e s t a rt a n d e n d o f t h e re g io n a re s h o wn in t h e / p ro c/ 1 / m a p s file , b u t t h e re g io n s ize ca n e a s ily b e d e rive d fro m t h e m ) . Th e p e rm is s io n s s p e cify t h a t t h e re g io n is e xe cu t a b le ( it co n t a in s o b je ct co d e ) , re a d o n ly ( it 's n o t writ a b le b e ca u s e t h e in s t ru ct io n s d o n 't ch a n g e d u rin g a ru n ) , a n d p riva t e , s o we ca n g u e s s t h a t t h e re g io n m a p s t h e t e xt s e g m e n t o f t h e p ro g ra m . Th e m e m o ry re g io n s t a rt in g fro m 0x804d000 is a m e m o ry m a p p in g a s s o cia t e d wit h a n o t h e r p o rt io n o f / s b in / in it ra n g in g fro m b yt e 1 6 3 8 4 ( co rre s p o n d in g t o o ffs e t 0x4000 s h o wn in Ta b le 2 0 - 4 ) t o 2 0 , 4 7 9 . S in ce t h e p e rm is s io n s s p e cify t h a t t h e p riva t e re g io n m a y b e writ t e n , we ca n co n clu d e t h a t it m a p s t h e d a t a s e g m e n t o f t h e p ro g ra m . Th e n e xt o n e - p a g e m e m o ry re g io n s t a rt in g fro m 0x0804e000 is a n o n ym o u s , t h a t is , it is n o t a s s o cia t e d wit h a n y file a n d in clu d e s t h e b s s s e g m e n t o f in it . S im ila rly, t h e n e xt t h re e m e m o ry re g io n s s t a rt in g fro m 0x40000000, 0x40015000, a n d

0x40016000 co rre s p o n d t o t h e t e xt s e g m e n t , t h e d a t a s e g m e n t , a n d t h e b s s s e g m e n t , re s p e ct ive ly, o f t h e / lib / ld . 2 . 2 . 3 . s o lib ra ry, wh ich is t h e p ro g ra m in t e rp re t e r fo r t h e ELF s h a re d lib ra rie s . Th e p ro g ra m in t e rp re t e r is n e ve r e xe cu t e d a lo n e : it is a lwa ys m e m o rym a p p e d in s id e t h e a d d re s s s p a ce o f a p ro ce s s e xe cu t in g a n o t h e r p ro g ra m . On t h is s ys t e m , t h e C lib ra ry h a p p e n s t o b e s t o re d in t h e / lib / lib c. 2 . 2 . 3 . s o file . Th e t e xt s e g m e n t , d a t a s e g m e n t , a n d b s s s e g m e n t o f t h e C lib ra ry a re m a p p e d in t o t h e n e xt t h re e m e m o ry re g io n s , s t a rt in g fro m a d d re s s 0x40020000. Re m e m b e r t h a t p a g e fra m e s in clu d e d in p riva t e re g io n s ca n b e s h a re d a m o n g s e ve ra l p ro ce s s e s wit h t h e Co p y On Writ e m e ch a n is m , a s lo n g a s t h e y a re n o t m o d ifie d . Th u s , s in ce t h e t e xt s e g m e n t is re a d - o n ly, t h e p a g e fra m e s co n t a in in g t h e e xe cu t a b le co d e o f t h e C lib ra ry a re s h a re d a m o n g a lm o s t a ll cu rre n t ly e xe cu t in g p ro ce s s e s ( a ll e xce p t t h e s t a t ica lly lin ke d o n e s ) . Fin a lly, t h e la s t a n o n ym o u s m e m o ry re g io n fro m 0xbfffd000 t o 0xbfffffff is a s s o cia t e d wit h t h e Us e r Mo d e s t a ck. We a lre a d y e xp la in e d in S e ct io n 8 . 4 h o w t h e s t a ck is a u t o m a t ica lly e xp a n d e d t o wa rd lo we r a d d re s s e s wh e n e ve r n e ce s s a ry.

20.1.5 Execution Tracing Ex e cu t io n t ra cin g is a t e ch n iq u e t h a t a llo ws a p ro g ra m t o m o n it o r t h e e xe cu t io n o f a n o t h e r p ro g ra m . Th e t ra ce d p ro g ra m ca n b e e xe cu t e d s t e p b y s t e p , u n t il a s ig n a l is re ce ive d , o r u n t il a s ys t e m ca ll is in vo ke d . Exe cu t io n t ra cin g is wid e ly u s e d b y d e b u g g e rs , t o g e t h e r wit h o t h e r t e ch n iq u e s like t h e in s e rt io n o f b re a kp o in t s in t h e d e b u g g e d p ro g ra m a n d ru n - t im e a cce s s t o it s va ria b le s . We fo cu s o n h o w t h e ke rn e l s u p p o rt s e xe cu t io n t ra cin g ra t h e r t h a n d is cu s s in g h o w d e b u g g e rs wo rk. In Lin u x, e xe cu t io n t ra cin g is p e rfo rm e d t h ro u g h t h e ptrace( ) s ys t e m ca ll, wh ich ca n h a n d le t h e co m m a n d s lis t e d in Ta b le 2 0 - 5 . Pro ce s s e s h a vin g t h e CAP_SYS_PTRACE ca p a b ilit y fla g s e t a re a llo we d t o t ra ce a n y p ro ce s s in t h e s ys t e m e xce p t in it . Co n ve rs e ly, a p ro ce s s P wit h n o CAP_SYS_PTRACE ca p a b ilit y is a llo we d t o t ra ce o n ly p ro ce s s e s h a vin g t h e s a m e o wn e r a s P. Mo re o ve r, a p ro ce s s ca n n o t b e t ra ce d b y t wo p ro ce s s e s a t t h e s a m e t im e .

Ta b le 2 0 - 5 . Th e p t ra c e c o m m a n d s

Co m m a n d

D e s c rip t io n

PTRACE_TRACEME

S t a rt e xe cu t io n t ra cin g fo r t h e cu rre n t p ro ce s s

PTRACE_PEEKTEXT

Re a d a 3 2 - b it va lu e fro m t h e t e xt s e g m e n t

PTRACE_PEEKDATA

Re a d a 3 2 - b it va lu e fro m t h e d a t a s e g m e n t

PTRACE_PEEKUSR

Re a d t h e CPU's n o rm a l a n d d e b u g re g is t e rs

PTRACE_POKETEXT

Writ e a 3 2 - b it va lu e in t o t h e t e xt s e g m e n t

PTRACE_POKEDATA

Writ e a 3 2 - b it va lu e in t o t h e d a t a s e g m e n t

PTRACE_POKEUSR

Writ e t h e CPU's n o rm a l a n d d e b u g re g is t e rs

PTRACE_CONT

Re s u m e e xe cu t io n

PTRACE_KILL

Kill t h e t ra ce d p ro ce s s

PTRACE_SINGLESTEP

Re s u m e e xe cu t io n fo r a s in g le a s s e m b ly la n g u a g e in s t ru ct io n

PTRACE_GETREGS

Re a d p rivile g e d CPU's re g is t e rs

PTRACE_SETREGS

Writ e p rivile g e d CPU's re g is t e rs

PTRACE_GETFPREGS

Re a d flo a t in g p o in t re g is t e rs

PTRACE_SETFPREGS

Writ e flo a t in g p o in t re g is t e rs

PTRACE_GETFPXREGS

Re a d MMX a n d XMM re g is t e rs

PTRACE_SETFPXREGS

Writ e MMX a n d XMM re g is t e rs

PTRACE_ATTACH

S t a rt e xe cu t io n t ra cin g fo r a n o t h e r p ro ce s s

PTRACE_DETACH

Te rm in a t e e xe cu t io n t ra cin g

PTRACE_SETOPTIONS

Mo d ify ptrace( ) b e h a vio r

PTRACE_SYSCALL

Re s u m e e xe cu t io n u n t il t h e n e xt s ys t e m ca ll b o u n d a ry

Th e ptrace( ) s ys t e m ca ll m o d ifie s t h e p_pptr fie ld in t h e d e s crip t o r o f t h e t ra ce d p ro ce s s s o t h a t it p o in t s t o t h e t ra cin g p ro ce s s ; t h e re fo re , t h e t ra cin g p ro ce s s b e co m e s t h e e ffe ct ive p a re n t o f t h e t ra ce d o n e . Wh e n e xe cu t io n t ra cin g t e rm in a t e s —i. e . , wh e n ptrace( ) is in vo ke d wit h t h e PTRACE_DETACH co m m a n d —t h e s ys t e m ca ll s e t s p_pptr t o t h e va lu e o f

p_opptr, t h u s re s t o rin g t h e o rig in a l p a re n t o f t h e t ra ce d p ro ce s s ( s e e S e ct io n 3 . 2 . 3 ) . S e ve ra l m o n it o re d e ve n t s ca n b e a s s o cia t e d wit h a t ra ce d p ro g ra m : ● ● ● ●

En d o f e xe cu t io n o f a s in g le a s s e m b ly la n g u a g e in s t ru ct io n En t e rin g a s ys t e m ca ll Exit in g fro m a s ys t e m ca ll Re ce ivin g a s ig n a l

Wh e n a m o n it o re d e ve n t o ccu rs , t h e t ra ce d p ro g ra m is s t o p p e d a n d a SIGCHLD s ig n a l is s e n t t o it s p a re n t . Wh e n t h e p a re n t wis h e s t o re s u m e t h e ch ild 's e xe cu t io n , it ca n u s e o n e o f t h e PTRACE_CONT, PTRACE_SINGLESTEP, a n d PTRACE_SYSCALL co m m a n d s , d e p e n d in g o n t h e kin d o f e ve n t it wa n t s t o m o n it o r. Th e PTRACE_CONT co m m a n d ju s t re s u m e s e xe cu t io n ; t h e ch ild e xe cu t e s u n t il it re ce ive s a n o t h e r s ig n a l. Th is kin d o f t ra cin g is im p le m e n t e d b y m e a n s o f t h e PT_PTRACED fla g in t h e

ptrace fie ld o f t h e p ro ce s s d e s crip t o r, wh ich is ch e cke d b y t h e do_signal( ) fu n ct io n ( s e e S e ct io n 1 0 . 3 ) . Th e PTRACE_SINGLESTEP co m m a n d fo rce s t h e ch ild p ro ce s s t o e xe cu t e t h e n e xt a s s e m b ly la n g u a g e in s t ru ct io n , a n d t h e n s t o p s it a g a in . Th is kin d o f t ra cin g is im p le m e n t e d o n 8 0 x 8 6 b a s e d m a ch in e s b y m e a n s o f t h e TF t ra p fla g in t h e eflags re g is t e r: wh e n it is o n , a "De b u g " e xce p t io n is ra is e d rig h t a ft e r a n y a s s e m b ly la n g u a g e in s t ru ct io n . Th e co rre s p o n d in g e xce p t io n h a n d le r ju s t cle a rs t h e fla g , fo rce s t h e cu rre n t p ro ce s s t o s t o p , a n d s e n d s a SIGCHLD s ig n a l t o it s p a re n t . No t ice t h a t s e t t in g t h e TF fla g is n o t a p rivile g e d o p e ra t io n , s o Us e r Mo d e p ro ce s s e s ca n fo rce s in g le - s t e p e xe cu t io n e ve n wit h o u t t h e ptrace( ) s ys t e m ca ll. Th e ke rn e l ch e cks t h e PT_DTRACE fla g in t h e p ro ce s s d e s crip t o r t o ke e p t ra ck o f wh e t h e r t h e ch ild p ro ce s s is b e in g s in g le - s t e p p e d t h ro u g h ptrace( ).

Th e PTRACE_SYSCALL co m m a n d ca u s e s t h e t ra ce d p ro ce s s t o re s u m e e xe cu t io n u n t il a s ys t e m ca ll is in vo ke d . Th e p ro ce s s is s t o p p e d t wice : t h e firs t t im e wh e n t h e s ys t e m ca ll s t a rt s , a n d t h e s e co n d t im e wh e n t h e s ys t e m ca ll t e rm in a t e s . Th is kin d o f t ra cin g is im p le m e n t e d b y m e a n s o f t h e PT_TRACESYS fla g in t h e p ro ce s s o r d e s crip t o r, wh ich is ch e cke d in t h e system_call( ) a s s e m b ly la n g u a g e fu n ct io n ( s e e S e ct io n 9 . 2 . 2 ) .

A p ro ce s s ca n a ls o b e t ra ce d u s in g s o m e d e b u g g in g fe a t u re s o f t h e In t e l Pe n t iu m p ro ce s s o rs . Fo r e xa m p le , t h e p a re n t co u ld s e t t h e va lu e s o f t h e dr0, . . . dr7 d e b u g re g is t e rs fo r t h e ch ild b y u s in g t h e PTRACE_POKEUSR co m m a n d . Wh e n a n e ve n t m o n it o re d b y a d e b u g re g is t e r o ccu rs , t h e CPU ra is e s t h e "De b u g " e xce p t io n ; t h e e xce p t io n h a n d le r ca n t h e n s u s p e n d t h e t ra ce d p ro ce s s a n d s e n d t h e SIGCHLD s ig n a l t o t h e p a re n t .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

20.2 Executable Formats Th e s t a n d a rd Lin u x e xe cu t a b le fo rm a t is n a m e d Ex e cu t a b le a n d Lin k in g Fo rm a t ( ELF) . It wa s d e ve lo p e d b y Un ix S ys t e m La b o ra t o rie s a n d is n o w t h e m o s t wid e ly u s e d fo rm a t in t h e Un ix wo rld . S e ve ra l we ll- kn o wn Un ix o p e ra t in g s ys t e m s , s u ch a s S ys t e m V Re le a s e 4 a n d S u n 's S o la ris 2 , h a ve a d o p t e d ELF a s t h e ir m a in e xe cu t a b le fo rm a t . Old e r Lin u x ve rs io n s s u p p o rt e d a n o t h e r fo rm a t n a m e d As s e m b le r OUTp u t Fo rm a t ( a . o u t ) ; a ct u a lly, t h e re we re s e ve ra l ve rs io n s o f t h a t fo rm a t flo a t in g a ro u n d t h e Un ix wo rld . It is s e ld o m u s e d n o w, s in ce ELF is m u ch m o re p ra ct ica l. Lin u x s u p p o rt s m a n y o t h e r d iffe re n t fo rm a t s fo r e xe cu t a b le file s ; in t h is wa y, it ca n ru n p ro g ra m s co m p ile d fo r o t h e r o p e ra t in g s ys t e m s , s u ch a s MS - DOS EXE p ro g ra m s o r BS D Un ix's COFF e xe cu t a b le s . A fe w e xe cu t a b le fo rm a t s , like Ja va o r b a s h s crip t s , a re p la t fo rm - in d e p e n d e n t . An e xe cu t a b le fo rm a t is d e s crib e d b y a n o b je ct o f t yp e linux_binfmt, wh ich e s s e n t ia lly p ro vid e s t h re e m e thods:

load_binary S e t s u p a n e w e xe cu t io n e n viro n m e n t fo r t h e cu rre n t p ro ce s s b y re a d in g t h e in fo rm a t io n s t o re d in a n e xe cu t a b le file .

load_shlib Dyn a m ica lly b in d s a s h a re d lib ra ry t o a n a lre a d y ru n n in g p ro ce s s ; it is a ct iva t e d b y t h e uselib( ) s ys t e m ca ll.

core_dump S t o re s t h e e xe cu t io n co n t e xt o f t h e cu rre n t p ro ce s s in a file n a m e d core. Th is file , wh o s e fo rm a t d e p e n d s o n t h e t yp e o f e xe cu t a b le o f t h e p ro g ra m b e in g e xe cu t e d , is u s u a lly cre a t e d wh e n a p ro ce s s re ce ive s a s ig n a l wh o s e d e fa u lt a ct io n is "d u m p " ( s e e S e ct io n 1 0 . 1 . 1 ) . All linux_binfmt o b je ct s a re in clu d e d in a s im p ly lin ke d lis t , a n d t h e a d d re s s o f t h e firs t e le m e n t is s t o re d in t h e formats va ria b le . Ele m e n t s ca n b e in s e rt e d a n d re m o ve d in t h e lis t b y in vo kin g t h e register_binfmt(

) a n d unregister_binfmt( ) fu n ct io n s . Th e register_binfmt( ) fu n ct io n is e xe cu t e d d u rin g s ys t e m s t a rt u p fo r e a ch e xe cu t a b le fo rm a t co m p ile d in t o t h e ke rn e l. Th is fu n ct io n is a ls o e xe cu t e d wh e n a m o d u le im p le m e n t in g a n e w e xe cu t a b le fo rm a t is b e in g lo a d e d , wh ile t h e unregister_binfmt( ) fu n ct io n is in vo ke d wh e n t h e m o d u le is u n lo a d e d . Th e la s t e le m e n t in t h e formats lis t is a lwa ys a n o b je ct d e s crib in g t h e e xe cu t a b le fo rm a t fo r in t e rp re t e d s crip t s . Th is fo rm a t d e fin e s o n ly t h e load_binary m e t h o d . Th e co rre s p o n d in g load_script( ) fu n ct io n ch e cks wh e t h e r t h e e xe cu t a b le file s t a rt s wit h t h e #! p a ir o f ch a ra ct e rs . If s o , it in t e rp re t s t h e re s t o f t h e firs t lin e a s t h e p a t h n a m e o f a n o t h e r e xe cu t a b le file a n d t rie s t o e xe cu t e it b y p a s s in g t h e n a m e o f t h e s crip t file a s a p a ra m e t e r. [ 5 ] [5]

It is p o s s ib le t o e xe cu t e a s crip t file e ve n if it d o e s n 't s t a rt wit h t h e #! ch a ra ct e rs , a s lo n g a s t h e file is writ t e n in t h e la n g u a g e re co g n ize d b y a co m m a n d s h e ll. In t h is ca s e , h o we ve r, t h e s crip t is in t e rp re t e d e it h e r b y t h e s h e ll o n wh ich t h e u s e r t yp e s t h e co m m a n d o r b y t h e d e fa u lt Bo u rn e s h e ll s h ; t h e re fo re , t h e ke rn e l is n o t d ire ct ly in vo lve d .

Lin u x a llo ws u s e rs t o re g is t e r t h e ir o wn cu s t o m e xe cu t a b le fo rm a t s . Ea ch s u ch fo rm a t m a y b e re co g n ize d e it h e r b y m e a n s o f a m a g ic n u m b e r s t o re d in t h e firs t 1 2 8 b yt e s o f t h e file , o r b y a file n a m e e xt e n s io n t h a t id e n t ifie s t h e file t yp e . Fo r e xa m p le , MS - DOS e xt e n s io n s co n s is t o f t h re e ch a ra ct e rs s e p a ra t e d fro m t h e

file n a m e b y a d o t : t h e . e x e e xt e n s io n id e n t ifie s e xe cu t a b le p ro g ra m s , wh ile t h e . b a t e xt e n s io n id e n t ifie s s h e ll s crip t s . Ea ch cu s t o m fo rm a t is a s s o cia t e d wit h a n in t e rp re t e r p ro g ra m , wh ich is a u t o m a t ica lly in vo ke d b y t h e ke rn e l wit h t h e o rig in a l cu s t o m e xe cu t a b le file n a m e a s a p a ra m e t e r. Th e m e ch a n is m is s im ila r t o t h e s crip t 's fo rm a t , b u t it 's m o re p o we rfu l s in ce it d o e s n 't im p o s e a n y re s t rict io n s o n t h e cu s t o m fo rm a t . To re g is t e r a n e w fo rm a t , t h e u s e r writ e s in t o t h e / p ro c/ s y s / fs / b in fm t _ m is c/ re g is t e r file a s t rin g wit h t h e fo llo win g fo rm a t :

:name:type:offset:string:mask:interpreter: wh e re e a ch fie ld h a s t h e fo llo win g m e a n in g :

name An id e n t ifie r fo r t h e n e w fo rm a t

type Th e t yp e o f re co g n it io n ( M fo r m a g ic n u m b e r, E fo r e xt e n s io n )

offset Th e s t a rt in g o ffs e t o f t h e m a g ic n u m b e r in s id e t h e file

string Th e b yt e s e q u e n ce t o b e m a t ch e d e it h e r in t h e m a g ic n u m b e r o r in t h e e xt e n s io n

mask Th e s t rin g t o m a s k o u t s o m e b it s in string

interpreter Th e fu ll p a t h n a m e o f t h e p ro g ra m in t e rp re t e r Fo r e xa m p le , t h e fo llo win g co m m a n d p e rfo rm e d b y t h e s u p e ru s e r e n a b le s t h e ke rn e l t o re co g n ize t h e Micro s o ft Win d o ws e xe cu t a b le fo rm a t :

$ echo ':DOSWin:M:0:MZ:0xff:/usr/local/bin/wine:' > /proc/sys/fs/binfmt_misc/register A Win d o ws e xe cu t a b le file h a s t h e MZ m a g ic n u m b e r in t h e firs t t wo b yt e s , a n d it is e xe cu t e d b y t h e / u s r/ lo ca l/ b in / w in e p ro g ra m in t e rp re t e r. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

20.3 Execution Domains As m e n t io n e d in Ch a p t e r 1 , a n e a t fe a t u re o f Lin u x is it s a b ilit y t o e xe cu t e file s co m p ile d fo r o t h e r o p e ra t in g s ys t e m s . Of co u rs e , t h is is p o s s ib le o n ly if t h e file s in clu d e m a ch in e co d e fo r t h e s a m e co m p u t e r a rch it e ct u re o n wh ich t h e ke rn e l is ru n n in g . Two kin d s o f s u p p o rt a re o ffe re d fo r t h e s e "fo re ig n " p ro g ra m s : ●



Em u la t e d e xe cu t io n : n e ce s s a ry t o e xe cu t e p ro g ra m s t h a t in clu d e s ys t e m ca lls t h a t a re n o t POS IX- co m p lia n t Na t ive e xe cu t io n : va lid fo r p ro g ra m s wh o s e s ys t e m ca lls a re t o t a lly POS IX- co m p lia n t

Micro s o ft MS - DOS a n d Win d o ws p ro g ra m s a re e m u la t e d : t h e y ca n n o t b e n a t ive ly e xe cu t e d , s in ce t h e y in clu d e APIs t h a t a re n o t re co g n ize d b y Lin u x. An e m u la t o r like DOS e m u o r Win e ( wh ich a p p e a re d in t h e e xa m p le a t t h e e n d o f t h e p re vio u s s e ct io n ) is in vo ke d t o t ra n s la t e e a ch API ca ll in t o a n e m u la t in g wra p p e r fu n ct io n ca ll, wh ich in t u rn u s e s t h e e xis t in g Lin u x s ys t e m ca lls . S in ce e m u la t o rs a re m o s t ly im p le m e n t e d a s Us e r Mo d e a p p lica t io n s , we d o n 't d is cu s s t h e m fu rt h e r. On t h e o t h e r h a n d , POS IX- co m p lia n t p ro g ra m s co m p ile d o n o p e ra t in g s ys t e m s o t h e r t h a n Lin u x ca n b e e xe cu t e d wit h o u t t o o m u ch t ro u b le , s in ce POS IX o p e ra t in g s ys t e m s o ffe r s im ila r APIs . ( Act u a lly, t h e APIs s h o u ld b e id e n t ica l, a lt h o u g h t h is is n o t a lwa ys t h e ca s e . ) Min o r d iffe re n ce s t h a t t h e ke rn e l m u s t iro n o u t u s u a lly re fe r t o h o w s ys t e m ca lls a re in vo ke d o r h o w t h e va rio u s s ig n a ls a re n u m b e re d . Th is in fo rm a t io n is s t o re d in e x e cu t io n d o m a in d e s crip t o rs o f t yp e exec_domain.

A p ro ce s s s p e cifie s it s e xe cu t io n d o m a in b y s e t t in g t h e personality fie ld o f it s d e s crip t o r a n d s t o rin g t h e a d d re s s o f t h e co rre s p o n d in g exec_domain d a t a s t ru ct u re in t h e

exec_domain fie ld . A p ro ce s s ca n ch a n g e it s p e rs o n a lit y b y is s u in g a s u it a b le s ys t e m ca ll n a m e d personality( ); t yp ica l va lu e s a s s u m e d b y t h e s ys t e m ca ll's p a ra m e t e r a re lis t e d in Ta b le 2 0 - 6 . Th e C lib ra ry d o e s n o t in clu d e a co rre s p o n d in g wra p p e r ro u t in e b e ca u s e p ro g ra m m e rs a re n o t e xp e ct e d t o d ire ct ly ch a n g e t h e p e rs o n a lit y o f t h e ir p ro g ra m s . In s t e a d , t h e personality( ) s ys t e m ca ll s h o u ld b e is s u e d b y t h e g lu e co d e t h a t s e t s u p t h e e xe cu t io n co n t e xt o f t h e p ro ce s s ( s e e t h e n e xt s e ct io n ) .

Ta b le 2 0 - 6 . Ma in p e rs o n a lit ie s s u p p o rt e d b y t h e Lin u x k e rn e l

P e rs o n a lit y

Op e ra t in g s y s t e m

PER_LINUX

S t a n d a rd e xe cu t io n d o m a in

PER_SVR4

S ys t e m V Re le a s e 4

PER_SVR3

S ys t e m V Re le a s e 3

PER_SCOSVR3

S CO Un ix Ve rs io n 3 . 2

PER_OSR5

S CO Op e n S e rve r Re le a s e 5

PER_WYSEV386

Un ix S ys t e m V/ 3 8 6 Re le a s e 3 . 2 . 1

PER_ISCR4

In t e ra ct ive Un ix

PER_BSD

BS D Un ix

PER_SUNOS

S u n OS

PER_XENIX

Xe n ix

PER_IRIX32

S GI Irix- 5 3 2 b it

PER_IRIXN32

S GI Irix- 6 3 2 b it

PER_IRIX64

S GI Irix- 6 6 4 b it

PER_RISCOS

RIS C OS

PER_SOLARIS

S u n 's S o la ris

PER_UW7

Ca ld e ra 's Un ixWa re 7

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

20.4 The exec Functions Un ix s ys t e m s p ro vid e a fa m ily o f fu n ct io n s t h a t re p la ce t h e e xe cu t io n co n t e xt o f a p ro ce s s wit h a n e w co n t e xt d e s crib e d b y a n e xe cu t a b le file . Th e n a m e s o f t h e s e fu n ct io n s s t a rt wit h t h e p re fix exec, fo llo we d b y o n e o r t wo le t t e rs ; t h e re fo re , a g e n e ric fu n ct io n in t h e fa m ily is u s u a lly re fe rre d t o a s a n exec fu n ct io n .

Th e exec fu n ct io n s a re lis t e d in Ta b le 2 0 - 7 ; t h e y d iffe r in h o w t h e p a ra m e t e rs a re in t e rp re t e d .

Ta b le 2 0 - 7 . Th e e x e c fu n c t io n s

Fu n c t io n n a m e

P ATH s e a rc h

Co m m a n d - lin e a rg u m e n t s

En v iro n m e n t a rra y

execl( )

No

Lis t

No

execlp( )

Ye s

Lis t

No

execle( )

No

Lis t

Ye s

execv( )

No

Arra y

No

execvp( )

Ye s

Arra y

No

execve( )

No

Arra y

Ye s

Th e firs t p a ra m e t e r o f e a ch fu n ct io n d e n o t e s t h e p a t h n a m e o f t h e file t o b e e xe cu t e d . Th e p a t h n a m e ca n b e a b s o lu t e o r re la t ive t o t h e p ro ce s s 's cu rre n t d ire ct o ry. Mo re o ve r, if t h e n a m e d o e s n o t in clu d e a n y / ch a ra ct e rs , t h e execlp( ) a n d execvp( ) fu n ct io n s s e a rch fo r t h e e xe cu t a b le file in a ll d ire ct o rie s s p e cifie d b y t h e PATH e n viro n m e n t va ria b le .

Be s id e s t h e firs t p a ra m e t e r, t h e execl( ), execlp( ), a n d execle( ) fu n ct io n s in clu d e a va ria b le n u m b e r o f a d d it io n a l p a ra m e t e rs . Ea ch p o in t s t o a s t rin g d e s crib in g a co m m a n d - lin e a rg u m e n t fo r t h e n e w p ro g ra m ; a s t h e " l" ch a ra ct e r in t h e fu n ct io n n a m e s s u g g e s t s , t h e p a ra m e t e rs a re o rg a n ize d in a lis t t e rm in a t e d b y a NULL va lu e . Us u a lly, t h e firs t co m m a n d lin e a rg u m e n t d u p lica t e s t h e e xe cu t a b le file n a m e . Co n ve rs e ly, t h e execv( ), execvp( ), a n d execve( ) fu n ct io n s s p e cify t h e co m m a n d - lin e a rg u m e n t s wit h a s in g le p a ra m e t e r; a s t h e v ch a ra ct e r in t h e fu n ct io n n a m e s s u g g e s t s , t h e p a ra m e t e r is t h e a d d re s s o f a ve ct o r o f p o in t e rs t o co m m a n d - lin e a rg u m e n t s t rin g s . Th e la s t co m p o n e n t o f t h e a rra y m u s t b e NULL.

Th e execle( ) a n d execve( ) fu n ct io n s re ce ive a s t h e ir la s t p a ra m e t e r t h e a d d re s s o f a n a rra y o f p o in t e rs t o e n viro n m e n t s t rin g s ; a s u s u a l, t h e la s t co m p o n e n t o f t h e a rra y m u s t b e

NULL. Th e o t h e r fu n ct io n s m a y a cce s s t h e e n viro n m e n t fo r t h e n e w p ro g ra m fro m t h e e xt e rn a l environ g lo b a l va ria b le , wh ich is d e fin e d in t h e C lib ra ry. All exec fu n ct io n s , wit h t h e e xce p t io n o f execve( ), a re wra p p e r ro u t in e s d e fin e d in t h e C lib ra ry a n d u s e execve( ), wh ich is t h e o n ly s ys t e m ca ll o ffe re d b y Lin u x t o d e a l wit h p ro g ra m e xe cu t io n . Th e sys_execve( ) s e rvice ro u t in e re ce ive s t h e fo llo win g p a ra m e t e rs :



Th e a d d re s s o f t h e e xe cu t a b le file p a t h n a m e ( in t h e Us e r Mo d e a d d re s s s p a ce ) . Th e a d d re s s o f a NULL- t e rm in a t e d a rra y ( in t h e Us e r Mo d e a d d re s s s p a ce ) o f



p o in t e rs t o s t rin g s ( a g a in in t h e Us e r Mo d e a d d re s s s p a ce ) ; e a ch s t rin g re p re s e n t s a co m m a n d - lin e a rg u m e n t . Th e a d d re s s o f a NULL- t e rm in a t e d a rra y ( in t h e Us e r Mo d e a d d re s s s p a ce ) o f



p o in t e rs t o s t rin g s ( a g a in in t h e Us e r Mo d e a d d re s s s p a ce ) ; e a ch s t rin g re p re s e n t s a n e n viro n m e n t va ria b le in t h e NAME=value fo rm a t .

Th e fu n ct io n co p ie s t h e e xe cu t a b le file p a t h n a m e in t o a n e wly a llo ca t e d p a g e fra m e . It t h e n in vo ke s t h e do_execve( ) fu n ct io n , p a s s in g t o it t h e p o in t e rs t o t h e p a g e fra m e , t o t h e p o in t e r's a rra ys , a n d t o t h e lo ca t io n o f t h e Ke rn e l Mo d e s t a ck wh e re t h e Us e r Mo d e re g is t e r co n t e n t s a re s a ve d . In t u rn , do_execve( ) p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . S t a t ica lly a llo ca t e s a linux_binprm d a t a s t ru ct u re , wh ich will b e fille d wit h d a t a co n ce rn in g t h e n e w e xe cu t a b le file . 2 . In vo ke s path_init( ), path_walk( ), a n d dentry_open( ) t o g e t t h e d e n t ry o b je ct , t h e file o b je ct , a n d t h e in o d e o b je ct a s s o cia t e d wit h t h e e xe cu t a b le file . On fa ilu re , re t u rn s t h e p ro p e r e rro r co d e . 3 . Ve rifie s t h a t t h e e xe cu t a b le file is n o t b e in g writ t e n b y ch e ckin g t h e i_writecount fie ld o f t h e in o d e ; s t o re s -1 in t h a t fie ld t o fo rb id fu rt h e r writ e a cce s s e s .

4 . In vo ke s t h e prepare_binprm( ) fu n ct io n t o fill t h e linux_binprm d a t a s t ru ct u re . Th is fu n ct io n , in t u rn , p e rfo rm s t h e fo llo win g o p e ra t io n s : a . Ch e cks wh e t h e r t h e p e rm is s io n s o f t h e file a llo w it s e xe cu t io n ; if n o t , re t u rn s a n e rro r co d e . b . In it ia lize s t h e e_uid a n d e_gid fie ld s o f t h e linux_binprm s t ru ct u re , t a kin g in t o a cco u n t t h e va lu e s o f t h e s e t u id a n d s e t g id fla g s o f t h e e xe cu t a b le file . Th e s e fie ld s re p re s e n t t h e e ffe ct ive u s e r a n d g ro u p IDs , re s p e ct ive ly. Als o ch e cks p ro ce s s ca p a b ilit ie s ( a co m p a t ib ilit y h a ck e xp la in e d in t h e e a rlie r s e ct io n S e ct io n 2 0 . 1 . 1 ) . c. Fills t h e buf fie ld o f t h e linux_binprm s t ru ct u re wit h t h e firs t 1 2 8 b yt e s o f t h e e xe cu t a b le file . Th e s e b yt e s in clu d e t h e m a g ic n u m b e r o f t h e e xe cu t a b le fo rm a t a n d o t h e r in fo rm a t io n s u it a b le fo r re co g n izin g t h e e xe cu t a b le file . 5 . Co p ie s t h e file p a t h n a m e , co m m a n d - lin e a rg u m e n t s , a n d e n viro n m e n t s t rin g s in t o

o n e o r m o re n e wly a llo ca t e d p a g e fra m e s . ( Eve n t u a lly, t h e y a re a s s ig n e d t o t h e Us e r Mo d e a d d re s s s p a ce . ) 6 . In vo ke s t h e search_binary_handler( ) fu n ct io n , wh ich s ca n s t h e formats lis t a n d t rie s t o a p p ly t h e load_binary m e t h o d o f e a ch e le m e n t , p a s s in g t o it t h e

linux_binprm d a t a s t ru ct u re . Th e s ca n o f t h e formats lis t t e rm in a t e s a s s o o n a s a load_binary m e t h o d s u cce e d s in a ckn o wle d g in g t h e e xe cu t a b le fo rm a t o f t h e file . 7 . If t h e e xe cu t a b le file fo rm a t is n o t p re s e n t in t h e formats lis t , re le a s e s a ll a llo ca t e d p a g e fra m e s a n d re t u rn s t h e e rro r co d e -ENOEXEC. Lin u x ca n n o t re co g n ize t h e e xe cu t a b le file fo rm a t . 8 . Ot h e rwis e , re t u rn s t h e co d e o b t a in e d fro m t h e load_binary m e t h o d a s s o cia t e d wit h t h e e xe cu t a b le fo rm a t o f t h e file . Th e load_binary m e t h o d co rre s p o n d in g t o a n e xe cu t a b le file fo rm a t p e rfo rm s t h e fo llo win g o p e ra t io n s ( we a s s u m e t h a t t h e e xe cu t a b le file is s t o re d o n a file s ys t e m t h a t a llo ws file m e m o ry m a p p in g a n d t h a t it re q u ire s o n e o r m o re s h a re d lib ra rie s ) : 1 . Ch e cks s o m e m a g ic n u m b e rs s t o re d in t h e firs t 1 2 8 b yt e s o f t h e file t o id e n t ify t h e e xe cu t a b le fo rm a t . If t h e m a g ic n u m b e rs d o n 't m a t ch , re t u rn s t h e e rro r co d e -

ENOEXEC. 2 . Re a d s t h e h e a d e r o f t h e e xe cu t a b le file . Th is h e a d e r d e s crib e s t h e p ro g ra m 's s e g m e n t s a n d t h e s h a re d lib ra rie s re q u e s t e d . 3 . Ge t s fro m t h e e xe cu t a b le file t h e p a t h n a m e o f t h e p ro g ra m in t e rp re t e r, wh ich is u s e d t o lo ca t e t h e s h a re d lib ra rie s a n d m a p t h e m in t o m e m o ry. 4 . Ge t s t h e d e n t ry o b je ct ( a s we ll a s t h e in o d e o b je ct a n d t h e file o b je ct ) o f t h e p ro g ra m in t e rp re t e r. 5 . Ch e cks t h e e xe cu t io n p e rm is s io n s o f t h e p ro g ra m in t e rp re t e r. 6 . Co p ie s t h e firs t 1 2 8 b yt e s o f t h e p ro g ra m in t e rp re t e r in t o a b u ffe r. 7 . Pe rfo rm s s o m e co n s is t e n cy ch e cks o n t h e p ro g ra m in t e rp re t e r t yp e . 8 . In vo ke s t h e flush_old_exec( ) fu n ct io n t o re le a s e a lm o s t a ll re s o u rce s u s e d b y t h e p re vio u s co m p u t a t io n ; in t u rn , t h is fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : a . If t h e t a b le o f s ig n a l h a n d le rs is s h a re d wit h o t h e r p ro ce s s e s , a llo ca t e s a n e w t a b le a n d d e cre m e n t s t h e u s a g e co u n t e r o f t h e o ld o n e ; t h is is d o n e b y in vo kin g t h e make_private_signals( ) fu n ct io n .

b . In vo ke s t h e exec_mmap( ) fu n ct io n t o re le a s e t h e m e m o ry d e s crip t o r, a ll m e m o ry re g io n s , a n d a ll p a g e fra m e s a s s ig n e d t o t h e p ro ce s s a n d t o cle a n u p t h e p ro ce s s 's Pa g e Ta b le s .

c. Up d a t e s t h e t a b le o f s ig n a l h a n d le rs b y re s e t t in g e a ch s ig n a l t o it s d e fa u lt a ct io n . Th is is d o n e b y in vo kin g t h e release_old_signals( ) a n d

flush_signal_handlers( ) fu n ct io n s . d . S e t s t h e comm fie ld o f t h e p ro ce s s d e s crip t o r wit h t h e e xe cu t a b le file pa thna m e . e . In vo ke s t h e flush_thread( ) fu n ct io n t o cle a r t h e va lu e s o f t h e flo a t in g p o in t re g is t e rs a n d d e b u g re g is t e rs s a ve d in t h e TS S s e g m e n t . f. In vo ke s t h e de_thread( ) fu n ct io n t o d e t a ch t h e p ro ce s s fro m t h e o ld t h re a d g ro u p ( s e e S e ct io n 3 . 2 . 2 ) . g . In vo ke s t h e flush_old_files( ) fu n ct io n t o clo s e a ll o p e n file s h a vin g t h e co rre s p o n d in g fla g in t h e files->close_on_exec fie ld o f t h e p ro ce s s d e s crip t o r s e t ( s e e S e ct io n 1 2 . 2 . 6 ) . [ 6 ] [6]

Th e s e fla g s ca n b e re a d a n d m o d ifie d b y m e a n s o f t h e

fcntl( ) s ys t e m ca ll. No w we h a ve re a ch e d t h e p o in t o f n o re t u rn : t h e fu n ct io n ca n n o t re s t o re t h e p re vio u s co m p u t a t io n if s o m e t h in g g o e s wro n g . ●

S e t s u p t h e n e w p e rs o n a lit y o f t h e p ro ce s s —t h a t is , t h e personality fie ld in t h e

p ro ce s s d e s crip t o r. ●

Cle a rs t h e PF_FORKNOEXEC fla g in t h e p ro ce s s d e s crip t o r. Th is fla g , wh ich is s e t wh e n a

p ro ce s s is fo rke d a n d cle a re d wh e n it e xe cu t e s a n e w p ro g ra m , is re q u ire d fo r p ro ce s s a cco u n t in g . ●

In vo ke s t h e setup_arg_pages( ) fu n ct io n t o a llo ca t e a n e w m e m o ry re g io n d e s crip t o r

fo r t h e p ro ce s s 's Us e r Mo d e s t a ck a n d t o in s e rt t h a t m e m o ry re g io n in t o t h e p ro ce s s 's a d d re s s s p a ce . setup_arg_pages( ) a ls o a s s ig n s t h e p a g e fra m e s co n t a in in g t h e co m m a n d - lin e a rg u m e n t s a n d t h e e n viro n m e n t va ria b le s t rin g s t o t h e n e w m e m o ry re g io n . ●

In vo ke s t h e do_mmap( ) fu n ct io n t o cre a t e a n e w m e m o ry re g io n t h a t m a p s t h e t e xt

s e g m e n t ( t h a t is , t h e co d e ) o f t h e e xe cu t a b le file . Th e in it ia l lin e a r a d d re s s o f t h e m e m o ry re g io n d e p e n d s o n t h e e xe cu t a b le fo rm a t , s in ce t h e p ro g ra m 's e xe cu t a b le co d e is u s u a lly n o t re lo ca t a b le . Th e re fo re , t h e fu n ct io n a s s u m e s t h a t t h e t e xt s e g m e n t is lo a d e d s t a rt in g fro m s o m e s p e cific lo g ica l a d d re s s o ffs e t ( a n d t h u s fro m s o m e s p e cifie d lin e a r a d d re s s ) . ELF p ro g ra m s a re lo a d e d s t a rt in g fro m lin e a r a d d re s s 0x08048000.



In vo ke s t h e do_mmap( ) fu n ct io n t o cre a t e a n e w m e m o ry re g io n t h a t m a p s t h e d a t a

segm ent depends s p e cifie d segm ent ●

o f t h e e xe cu t a b le file . Ag a in , t h e in it ia l lin e a r a d d re s s o f t h e m e m o ry re g io n o n t h e e xe cu t a b le fo rm a t , s in ce t h e e xe cu t a b le co d e e xp e ct s t o fin d it s va ria b le s a t o ffs e t s ( t h a t is , a t s p e cifie d lin e a r a d d re s s e s ) . In a n ELF p ro g ra m , t h e d a t a is lo a d e d rig h t a ft e r t h e t e xt s e g m e n t .

Allo ca t e s a d d it io n a l m e m o ry re g io n s fo r a n y o t h e r s p e cia lize d s e g m e n t s o f t h e e xe cu t a b le

file . Us u a lly, t h e re a re n o n e . In vo ke s a fu n ct io n t h a t lo a d s t h e p ro g ra m in t e rp re t e r. If t h e p ro g ra m in t e rp re t e r is a n ELF e xe cu t a b le , t h e fu n ct io n is n a m e d load_elf_interp( ). In g e n e ra l, t h e fu n ct io n



p e rfo rm s t h e o p e ra t io n s in S t e p s 1 1 t h ro u g h 1 3 , b u t fo r t h e p ro g ra m in t e rp re t e r in s t e a d o f t h e file t o b e e xe cu t e d . Th e in it ia l a d d re s s e s o f t h e m e m o ry re g io n s t h a t will in clu d e t h e t e xt a n d d a t a o f t h e p ro g ra m in t e rp re t e r a re s p e cifie d b y t h e p ro g ra m in t e rp re t e r it s e lf; h o we ve r, t h e y a re ve ry h ig h ( u s u a lly a b o ve 0x40000000) t o a vo id co llis io n s wit h t h e m e m o ry re g io n s t h a t m a p t h e t e xt a n d d a t a o f t h e file t o b e e xe cu t e d ( s e e t h e e a rlie r s e ct io n S e ct io n 2 0 . 1 . 4 ) .



S t o re s in t h e binfmt fie ld o f t h e p ro ce s s d e s crip t o r t h e a d d re s s o f t h e linux_binfmt

o b je ct o f t h e e xe cu t a b le fo rm a t . ●

De t e rm in e s t h e n e w ca p a b ilit ie s o f t h e p ro ce s s .

● Cre a t e s s p e cific p ro g ra m in t e rp re t e r t a b le s a n d s t o re s t h e m o n t h e Us e r Mo d e s t a ck b e t we e n t h e co m m a n d - lin e a rg u m e n t s a n d t h e a rra y o f p o in t e rs t o e n viro n m e n t s t rin g s ( s e e Fig u re 2 0 - 1 ) .



S e t s t h e va lu e s o f t h e start_code, end_code, end_data, start_brk, brk, a n d

start_stack fie ld s o f t h e p ro ce s s 's m e m o ry d e s crip t o r. ●

In vo ke s t h e do_brk( ) fu n ct io n t o cre a t e a n e w a n o n ym o u s m e m o ry re g io n m a p p in g

t h e b s s s e g m e n t o f t h e p ro g ra m . ( Wh e n t h e p ro ce s s writ e s in t o a va ria b le , it t rig g e rs d e m a n d p a g in g , a n d t h u s t h e a llo ca t io n o f a p a g e fra m e . ) Th e s ize o f t h is m e m o ry re g io n wa s co m p u t e d wh e n t h e e xe cu t a b le p ro g ra m wa s lin ke d . Th e in it ia l lin e a r a d d re s s o f t h e m e m o ry re g io n m u s t b e s p e cifie d , s in ce t h e p ro g ra m 's e xe cu t a b le co d e is u s u a lly n o t re lo ca t a b le . In a n ELF p ro g ra m , t h e b s s s e g m e n t is lo a d e d rig h t a ft e r t h e d a t a s e g m e n t . ●

In vo ke s t h e start_thread( ) m a cro t o m o d ify t h e va lu e s o f t h e Us e r Mo d e re g is t e rs

eip a n d esp s a ve d o n t h e Ke rn e l Mo d e s t a ck, s o t h a t t h e y p o in t t o t h e e n t ry p o in t o f t h e p ro g ra m in t e rp re t e r a n d t o t h e t o p o f t h e n e w Us e r Mo d e s t a ck, re s p e ct ive ly. ●

If t h e p ro ce s s is b e in g t ra ce d , s e n d s t h e SIGTRAP s ig n a l t o it .



Re t u rn s t h e va lu e 0 ( s u cce s s ) .

Wh e n t h e execve( ) s ys t e m ca ll t e rm in a t e s a n d t h e ca llin g p ro ce s s re s u m e s it s e xe cu t io n in Us e r Mo d e , t h e e xe cu t io n co n t e xt is d ra m a t ica lly ch a n g e d : t h e co d e t h a t in vo ke d t h e s ys t e m ca ll n o lo n g e r e xis t s . In t h is s e n s e , we co u ld s a y t h a t execve( ) n e ve r re t u rn s o n s u cce s s . In s t e a d , a n e w p ro g ra m t o b e e xe cu t e d is m a p p e d in t h e a d d re s s s p a ce o f t h e p ro ce s s . Ho we ve r, t h e n e w p ro g ra m ca n n o t ye t b e e xe cu t e d , s in ce t h e p ro g ra m in t e rp re t e r m u s t s t ill t a ke ca re o f lo a d in g t h e s h a re d lib ra rie s . [ 7 ] [7]

Th in g s a re m u ch s im p le r if t h e e xe cu t a b le file is s t a t ica lly lin ke d —t h a t is , if n o s h a re d lib ra ry is re q u e s t e d . Th e load_binary m e t h o d ju s t m a p s t h e t e xt , d a t a , b s s , a n d s t a ck

s e g m e n t s o f t h e p ro g ra m in t o t h e p ro ce s s m e m o ry re g io n s , a n d t h e n s e t s t h e Us e r Mo d e eip re g is t e r t o t h e e n t ry p o in t o f t h e n e w p ro g ra m . Alt h o u g h t h e p ro g ra m in t e rp re t e r ru n s in Us e r Mo d e , we b rie fly s ke t ch o u t h e re h o w it o p e ra t e s . It s firs t jo b is t o s e t u p a b a s ic e xe cu t io n co n t e xt fo r it s e lf, s t a rt in g fro m t h e in fo rm a t io n s t o re d b y t h e ke rn e l in t h e Us e r Mo d e s t a ck b e t we e n t h e a rra y o f p o in t e rs t o e n viro n m e n t s t rin g s a n d arg_start. Th e n t h e p ro g ra m in t e rp re t e r m u s t e xa m in e t h e p ro g ra m t o b e e xe cu t e d t o id e n t ify wh ich s h a re d lib ra rie s m u s t b e lo a d e d a n d wh ich fu n ct io n s in e a ch s h a re d lib ra ry a re e ffe ct ive ly re q u e s t e d . Ne xt , t h e in t e rp re t e r is s u e s s e ve ra l mmap( ) s ys t e m ca lls t o cre a t e m e m o ry re g io n s m a p p in g t h e p a g e s t h a t will h o ld t h e lib ra ry fu n ct io n s ( t e xt a n d d a t a ) a ct u a lly u s e d b y t h e p ro g ra m . Th e n t h e in t e rp re t e r u p d a t e s a ll re fe re n ce s t o t h e s ym b o ls o f t h e s h a re d lib ra ry, a cco rd in g t o t h e lin e a r a d d re s s e s o f t h e lib ra ry's m e m o ry re g io n s . Fin a lly, t h e p ro g ra m in t e rp re t e r t e rm in a t e s it s e xe cu t io n b y ju m p in g t o t h e m a in e n t ry p o in t o f t h e p ro g ra m t o b e e xe cu t e d . Fro m n o w o n , t h e p ro ce s s will e xe cu t e t h e co d e o f t h e e xe cu t a b le file a n d o f t h e s h a re d lib ra rie s . As yo u m a y h a ve n o t ice d , e xe cu t in g a p ro g ra m is a co m p le x a ct ivit y t h a t in vo lve s m a n y fa ce t s o f ke rn e l d e s ig n , s u ch a s p ro ce s s a b s t ra ct io n , m e m o ry m a n a g e m e n t , s ys t e m ca lls , a n d file s ys t e m s . It is t h e kin d o f t o p ic t h a t m a ke s yo u re a lize wh a t a m a rve lo u s p ie ce o f wo rk Lin u x is !

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Appendix A. System Startup Th is a p p e n d ix e xp la in s wh a t h a p p e n s rig h t a ft e r u s e rs s wit ch o n t h e ir co m p u t e rs —t h a t is , h o w a Lin u x ke rn e l im a g e is co p ie d in t o m e m o ry a n d e xe cu t e d . In s h o rt , we d is cu s s h o w t h e ke rn e l, a n d t h u s t h e wh o le s ys t e m , is "b o o t s t ra p p e d . " Tra d it io n a lly, t h e t e rm b o o t s t ra p re fe rs t o a p e rs o n wh o t rie s t o s t a n d u p b y p u llin g h is o wn b o o t s . In o p e ra t in g s ys t e m s , t h e t e rm d e n o t e s b rin g in g a t le a s t a p o rt io n o f t h e o p e ra t in g s ys t e m in t o m a in m e m o ry a n d h a vin g t h e p ro ce s s o r e xe cu t e it . It a ls o d e n o t e s t h e in it ia liza t io n o f ke rn e l d a t a s t ru ct u re s , t h e cre a t io n o f s o m e u s e r p ro ce s s e s , a n d t h e t ra n s fe r o f co n t ro l t o o n e o f t h e m . Co m p u t e r b o o t s t ra p p in g is a t e d io u s , lo n g t a s k, s in ce in it ia lly, n e a rly e ve ry h a rd wa re d e vice , in clu d in g t h e RAM, is in a ra n d o m , u n p re d ict a b le s t a t e . Mo re o ve r, t h e b o o t s t ra p p ro ce s s is h ig h ly d e p e n d e n t o n t h e co m p u t e r a rch it e ct u re ; a s u s u a l, we re fe r t o IBM's PC a rch it e ct u re in t h is a p p e n d ix.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

A.1 Prehistoric Age: The BIOS Th e m o m e n t a ft e r a co m p u t e r is p o we re d o n , it is p ra ct ica lly u s e le s s b e ca u s e t h e RAM ch ip s co n t a in ra n d o m d a t a a n d n o o p e ra t in g s ys t e m is ru n n in g . To b e g in t h e b o o t , a s p e cia l h a rd wa re circu it ra is e s t h e lo g ica l va lu e o f t h e RES ET p in o f t h e CPU. Aft e r RES ET is a s s e rt e d , s o m e re g is t e rs o f t h e p ro ce s s o r ( in clu d in g cs a n d eip) a re s e t t o fixe d va lu e s , a n d t h e co d e fo u n d a t p h ys ica l a d d re s s 0xfffffff0 is e xe cu t e d . Th is a d d re s s is m a p p e d b y t h e h a rd wa re t o a ce rt a in re a d - o n ly, p e rs is t e n t m e m o ry ch ip t h a t is o ft e n ca lle d Re a d - On ly Me m o ry ( ROM) . Th e s e t o f p ro g ra m s s t o re d in ROM is t ra d it io n a lly ca lle d Ba s ic In p u t / Ou t p u t S ys t e m ( BIOS ) , s in ce it in clu d e s s e ve ra l in t e rru p t - d rive n lo w- le ve l p ro ce d u re s u s e d b y s o m e o p e ra t in g s ys t e m s , in clu d in g Micro s o ft 's MS - DOS , t o h a n d le t h e h a rd wa re d e vice s t h a t m a ke u p t h e co m p u t e r. On ce in it ia lize d , Lin u x d o e s n o t u s e BIOS , b u t p ro vid e s it s o wn d e vice d rive r fo r e ve ry h a rd wa re d e vice o n t h e co m p u t e r. In fa ct , t h e BIOS p ro ce d u re s m u s t b e e xe cu t e d in re a l m o d e , wh ile t h e ke rn e l e xe cu t e s in p ro t e ct e d m o d e ( s e e S e ct io n 2 . 2 ) , s o t h e y ca n n o t s h a re fu n ct io n s e ve n if t h a t wo u ld b e b e n e ficia l. Th e BIOS u s e s Re a l Mo d e a d d re s s e s b e ca u s e t h e y a re t h e o n ly o n e s a va ila b le wh e n t h e co m p u t e r is t u rn e d o n . A Re a l Mo d e a d d re s s is co m p o s e d o f a s e g s e g m e n t a n d a n o ff o ffs e t ; t h e co rre s p o n d in g p h ys ica l a d d re s s is g ive n b y s e g * 1 6 + o ff. As a re s u lt , n o Glo b a l De s crip t o r Ta b le , Lo ca l De s crip t o r Ta b le , o r p a g in g t a b le is n e e d e d b y t h e CPU a d d re s s in g circu it t o t ra n s la t e a lo g ica l a d d re s s in t o a p h ys ica l o n e . Cle a rly, t h e co d e t h a t in it ia lize s t h e GDT, LDT, a n d p a g in g t a b le s m u s t ru n in Re a l Mo d e . Lin u x is fo rce d t o u s e BIOS in t h e b o o t s t ra p p in g p h a s e , wh e n it m u s t re t rie ve t h e ke rn e l im a g e fro m d is k o r fro m s o m e o t h e r e xt e rn a l d e vice . Th e BIOS b o o t s t ra p p ro ce d u re e s s e n t ia lly p e rfo rm s t h e fo llo win g fo u r o p e ra t io n s : 1 . Exe cu t e s a s e rie s o f t e s t s o n t h e co m p u t e r h a rd wa re t o e s t a b lis h wh ich d e vice s a re p re s e n t a n d wh e t h e r t h e y a re wo rkin g p ro p e rly. Th is p h a s e is o ft e n ca lle d Po we r- On S e lf- Te s t ( POS T) . Du rin g t h is p h a s e , s e ve ra l m e s s a g e s , s u ch a s t h e BIOS ve rs io n b a n n e r, a re d is p la ye d . 2 . In it ia lize s t h e h a rd wa re d e vice s . Th is p h a s e is cru cia l in m o d e rn PCI- b a s e d a rch it e ct u re s , s in ce it g u a ra n t e e s t h a t a ll h a rd wa re d e vice s o p e ra t e wit h o u t co n flict s o n t h e IRQ lin e s a n d I/ O p o rt s . At t h e e n d o f t h is p h a s e , a t a b le o f in s t a lle d PCI d e vice s is d is p la ye d . 3 . S e a rch e s fo r a n o p e ra t in g s ys t e m t o b o o t . Act u a lly, d e p e n d in g o n t h e BIOS s e t t in g , t h e p ro ce d u re m a y t ry t o a cce s s ( in a p re d e fin e d , cu s t o m iza b le o rd e r) t h e firs t s e ct o r ( b o o t s e ct o r) o f a n y flo p p y d is k, h a rd d is k, a n d CD- ROM in t h e s ys t e m . 4 . As s o o n a s a va lid d e vice is fo u n d , co p ie s t h e co n t e n t s o f it s firs t s e ct o r in t o RAM, s t a rt in g fro m p h ys ica l a d d re s s 0x00007c00, a n d t h e n ju m p s in t o t h a t a d d re s s a n d e xe cu t e s t h e co d e ju s t lo a d e d . Th e re s t o f t h is a p p e n d ix t a ke s yo u fro m t h e m o s t p rim it ive s t a rt in g s t a t e t o t h e fu ll g lo ry o f a ru n n in g Lin u x s ys t e m . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

A.2 Ancient Age: The Boot Loader Th e b o o t lo a d e r is t h e p ro g ra m in vo ke d b y t h e BIOS t o lo a d t h e im a g e o f a n o p e ra t in g s ys t e m ke rn e l in t o RAM. Le t 's b rie fly s ke t ch h o w b o o t lo a d e rs wo rk in IBM's PC a rch it e ct u re . To b o o t fro m a flo p p y d is k, t h e in s t ru ct io n s s t o re d in it s firs t s e ct o r a re lo a d e d in RAM a n d e xe cu t e d ; t h e s e in s t ru ct io n s co p y a ll t h e re m a in in g s e ct o rs co n t a in in g t h e ke rn e l im a g e in t o RAM. Bo o t in g fro m a h a rd d is k is d o n e d iffe re n t ly. Th e firs t s e ct o r o f t h e h a rd d is k, n a m e d t h e Ma s t e r Bo o t Re co rd ( MBR) , in clu d e s t h e p a rt it io n t a b le [ 1 ] a n d a s m a ll p ro g ra m , wh ich lo a d s t h e firs t s e ct o r o f t h e p a rt it io n co n t a in in g t h e o p e ra t in g s ys t e m t o b e s t a rt e d . S o m e o p e ra t in g s ys t e m s , s u ch a s Micro s o ft Win d o ws 9 8 , id e n t ify t h is p a rt it io n b y m e a n s o f a n a ct iv e fla g in clu d e d in t h e p a rt it io n t a b le ; [ 2 ] fo llo win g t h is a p p ro a ch , o n ly t h e o p e ra t in g s ys t e m wh o s e ke rn e l im a g e is s t o re d in t h e a ct ive p a rt it io n ca n b e b o o t e d . As we s h a ll s e e la t e r, Lin u x is m o re fle xib le b e ca u s e it re p la ce s t h e ru d im e n t a ry p ro g ra m in clu d e d in t h e MBR wit h a s o p h is t ica t e d p ro g ra m s u ch a s LILO o r GRa n d Un ifie d Bo o t lo a d e r ( GRUB) t h a t a llo ws u s e rs t o s e le ct t h e o p e ra t in g s ys t e m t o b e b o o t e d . [1]

Ea ch p a rt it io n t a b le e n t ry t yp ica lly in clu d e s t h e s t a rt in g a n d e n d in g s e ct o rs o f a p a rt it io n a n d t h e kin d o f o p e ra t in g s ys t e m t h a t h a n d le s it . [2]

Th e a ct ive fla g m a y b e s e t t h ro u g h p ro g ra m s like MS - DOS 's FDIS K.

A.2.1 Booting Linux from Floppy Disk Th e o n ly wa y t o s t o re a Lin u x ke rn e l o n a s in g le flo p p y d is k is t o co m p re s s t h e ke rn e l im a g e . As we s h a ll s e e , co m p re s s io n is d o n e a t co m p ile t im e a n d d e co m p re s s io n is d o n e b y t h e lo a d e r. If t h e Lin u x ke rn e l is lo a d e d fro m a flo p p y d is k, t h e b o o t lo a d e r is q u it e s im p le . It is co d e d in t h e a rch / i3 8 6 / b o o t / b o o t s e ct . S a s s e m b ly la n g u a g e file . Wh e n a n e w ke rn e l im a g e is p ro d u ce d b y co m p ilin g t h e ke rn e l s o u rce , t h e e xe cu t a b le co d e yie ld e d b y t h is a s s e m b ly la n g u a g e file is p la ce d a t t h e b e g in n in g o f t h e ke rn e l im a g e file . Th u s , it is ve ry e a s y t o p ro d u ce a b o o t a b le flo p p y b y co p yin g t h e Lin u x ke rn e l im a g e t o t h e flo p p y d is k s t a rt in g fro m t h e firs t s e ct o r o f t h e d is k. Wh e n t h e BIOS lo a d s t h e firs t s e ct o r o f t h e flo p p y d is k in t o m e m o ry, it a ct u a lly co p ie s t h e co d e o f t h e b o o t lo a d e r. Th e b o o t lo a d e r, wh ich is in vo ke d b y t h e BIOS b y ju m p in g t o p h ys ica l a d d re s s 0x00007c00, p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Mo ve s it s e lf fro m a d d re s s 0x00007c00 t o a d d re s s 0x00090000.

2 . S e t s u p t h e Re a l Mo d e s t a ck fro m a d d re s s 0x00003ff4. As u s u a l, t h e s t a ck g ro ws t o wa rd lo we r a d d re s s e s . 3 . S e t s u p t h e d is k p a ra m e t e r t a b le , u s e d b y t h e BIOS t o h a n d le t h e flo p p y d e vice

d rive r. 4 . In vo ke s a BIOS p ro ce d u re t o d is p la y a "Lo a d in g " m e s s a g e . 5 . In vo ke s a BIOS p ro ce d u re t o lo a d t h e setup( ) co d e o f t h e ke rn e l im a g e fro m t h e flo p p y d is k a n d p u t s it in RAM s t a rt in g fro m a d d re s s 0x00090200.

6 . In vo ke s a BIOS p ro ce d u re t o lo a d t h e re s t o f t h e ke rn e l im a g e fro m t h e flo p p y d is k a n d p u t s t h e im a g e in RAM s t a rt in g fro m e it h e r lo w a d d re s s 0x00010000 ( fo r s m a ll ke rn e l im a g e s co m p ile d wit h make zImage) o r h ig h a d d re s s 0x00100000 ( fo r b ig ke rn e l im a g e s co m p ile d wit h make bzImage) . In t h e fo llo win g d is cu s s io n , we s a y t h a t t h e ke rn e l im a g e is "lo a d e d lo w" o r "lo a d e d h ig h " in RAM, re s p e ct ive ly. S u p p o rt fo r b ig ke rn e l im a g e s u s e s e s s e n t ia lly t h e s a m e b o o t in g s ch e m e a s t h e o t h e r o n e , b u t it p la ce s d a t a in d iffe re n t p h ys ica l m e m o ry a d d re s s e s t o a vo id p ro b le m s wit h t h e IS A h o le m e n t io n e d in S e ct io n 2 . 5 . 3 . 7 . Ju m p s t o t h e setup( ) co d e .

A.2.2 Booting Linux from Hard Disk In m o s t ca s e s , t h e Lin u x ke rn e l is lo a d e d fro m a h a rd d is k, a n d a t wo - s t a g e b o o t lo a d e r is re q u ire d . Th e m o s t co m m o n ly u s e d Lin u x b o o t lo a d e r o n 8 0 x 8 6 s ys t e m s is n a m e d LIn u x LOa d e r ( LILO) ; co rre s p o n d in g p ro g ra m s e xis t fo r o t h e r a rch it e ct u re s . LILO m a y b e in s t a lle d e it h e r o n t h e MBR ( re p la cin g t h e s m a ll p ro g ra m t h a t lo a d s t h e b o o t s e ct o r o f t h e a ct ive p a rt it io n ) o r in t h e b o o t s e ct o r o f a ( u s u a lly a ct ive ) d is k p a rt it io n . In b o t h ca s e s , t h e fin a l re s u lt is t h e s a m e : wh e n t h e lo a d e r is e xe cu t e d a t b o o t t im e , t h e u s e r m a y ch o o s e wh ich o p e ra t in g s ys t e m t o lo a d . Th e LILO b o o t lo a d e r is b ro ke n in t o t wo p a rt s , s in ce o t h e rwis e it is t o o la rg e t o fit in t o t h e MBR. Th e MBR o r t h e p a rt it io n b o o t s e ct o r in clu d e s a s m a ll b o o t lo a d e r, wh ich is lo a d e d in t o RAM s t a rt in g fro m a d d re s s 0x00007c00 b y t h e BIOS . Th is s m a ll p ro g ra m m o ve s it s e lf t o t h e a d d re s s 0x0009a000, s e t s u p t h e Re a l Mo d e s t a ck ( ra n g in g fro m 0x0009b000 t o

0x0009a200) , a n d lo a d s t h e s e co n d p a rt o f t h e LILO b o o t lo a d e r in t o RAM s t a rt in g fro m a d d re s s 0x0009b000. In t u rn , t h is la t t e r p ro g ra m re a d s a m a p o f a va ila b le o p e ra t in g s ys t e m s fro m d is k a n d o ffe rs t h e u s e r a p ro m p t s o s h e ca n ch o o s e o n e o f t h e m . Fin a lly, a ft e r t h e u s e r h a s ch o s e n t h e ke rn e l t o b e lo a d e d ( o r le t a t im e - o u t e la p s e s o t h a t LILO ch o o s e s a d e fa u lt ) , t h e b o o t lo a d e r m a y e it h e r co p y t h e b o o t s e ct o r o f t h e co rre s p o n d in g p a rt it io n in t o RAM a n d e xe cu t e it o r d ire ct ly co p y t h e ke rn e l im a g e in t o RAM. As s u m in g t h a t a Lin u x ke rn e l im a g e m u s t b e b o o t e d , t h e LILO b o o t lo a d e r, wh ich re lie s o n BIOS ro u t in e s , p e rfo rm s e s s e n t ia lly t h e s a m e o p e ra t io n s a s t h e b o o t lo a d e r in t e g ra t e d in t o t h e ke rn e l im a g e d e s crib e d in t h e p re vio u s s e ct io n a b o u t flo p p y d is ks . Th e lo a d e r d is p la ys t h e "Lo a d in g Lin u x" m e s s a g e ; t h e n it co p ie s t h e in t e g ra t e d b o o t lo a d e r o f t h e ke rn e l im a g e t o a d d re s s 0x00090000, t h e setup( ) co d e t o a d d re s s 0x00090200, a n d t h e re s t o f t h e ke rn e l im a g e t o a d d re s s 0x00010000 o r 0x00100000. Th e n it ju m p s t o t h e setup( ) co d e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

A.3 Middle Ages: The setup( ) Function Th e co d e o f t h e setup( ) a s s e m b ly la n g u a g e fu n ct io n is p la ce d b y t h e lin ke r im m e d ia t e ly a ft e r t h e in t e g ra t e d b o o t lo a d e r o f t h e ke rn e l—t h a t is , a t o ffs e t 0x200 o f t h e ke rn e l im a g e file . Th e b o o t lo a d e r ca n t h e re fo re e a s ily lo ca t e t h e co d e a n d co p y it in t o RAM, s t a rt in g fro m p h ys ica l a d d re s s 0x00090200.

Th e setup( ) fu n ct io n m u s t in it ia lize t h e h a rd wa re d e vice s in t h e co m p u t e r a n d s e t u p t h e e n viro n m e n t fo r t h e e xe cu t io n o f t h e ke rn e l p ro g ra m . Alt h o u g h t h e BIOS a lre a d y in it ia lize d m o s t h a rd wa re d e vice s , Lin u x d o e s n o t re ly o n it , b u t re in it ia lize s t h e d e vice s in it s o wn m a n n e r t o e n h a n ce p o rt a b ilit y a n d ro b u s t n e s s . setup( ) p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . In vo ke s a BIOS p ro ce d u re t o fin d o u t t h e a m o u n t o f RAM a va ila b le in t h e s ys t e m . 2 . S e t s t h e ke yb o a rd re p e a t d e la y a n d ra t e . ( Wh e n t h e u s e r ke e p s a ke y p re s s e d p a s t a ce rt a in a m o u n t o f t im e , t h e ke yb o a rd d e vice s e n d s t h e co rre s p o n d in g ke yco d e o ve r a n d o ve r t o t h e CPU. ) 3 . In it ia lize s t h e vid e o a d a p t e r ca rd . 4 . Re in it ia lize s t h e d is k co n t ro lle r a n d d e t e rm in e s t h e h a rd d is k p a ra m e t e rs . 5 . Ch e cks fo r a n IBM Micro Ch a n n e l b u s ( MCA) . 6 . Ch e cks fo r a PS / 2 p o in t in g d e vice ( b u s m o u s e ) . 7 . Ch e cks fo r Ad va n ce d Po we r Ma n a g e m e n t ( APM) BIOS s u p p o rt . 8 . If t h e ke rn e l im a g e wa s lo a d e d lo w in RAM ( a t p h ys ica l a d d re s s 0x00010000) , m o ve s it t o p h ys ica l a d d re s s 0x00001000. Co n ve rs e ly, if t h e ke rn e l im a g e wa s lo a d e d h ig h in RAM, t h e fu n ct io n d o e s n o t m o ve it . Th is s t e p is n e ce s s a ry b e ca u s e t o b e a b le t o s t o re t h e ke rn e l im a g e o n a flo p p y d is k a n d t o s a ve t im e wh ile b o o t in g , t h e ke rn e l im a g e s t o re d o n d is k is co m p re s s e d , a n d t h e d e co m p re s s io n ro u t in e n e e d s s o m e fre e s p a ce t o u s e a s a t e m p o ra ry b u ffe r fo llo win g t h e ke rn e l im a g e in RAM. 9 . S e t s u p a p ro vis io n a l In t e rru p t De s crip t o r Ta b le ( IDT) a n d a p ro vis io n a l Glo b a l De s crip t o r Ta b le ( GDT) . 1 0 . Re s e t s t h e flo a t in g - p o in t u n it ( FPU) , if a n y. 1 1 . Re p ro g ra m s t h e Pro g ra m m a b le In t e rru p t Co n t ro lle r ( PIC) a n d m a p s t h e 1 6 h a rd wa re in t e rru p t s ( IRQ lin e s ) t o t h e ra n g e o f ve ct o rs fro m 3 2 t o 4 7 . Th e ke rn e l m u s t p e rfo rm t h is s t e p b e ca u s e t h e BIOS e rro n e o u s ly m a p s t h e h a rd wa re in t e rru p t s in t h e ra n g e fro m 0 t o 1 5 , wh ich is a lre a d y u s e d fo r CPU e xce p t io n s ( s e e S e ct io n 4 . 2 . 2 ) . 1 2 . S wit ch e s t h e CPU fro m Re a l Mo d e t o Pro t e ct e d Mo d e b y s e t t in g t h e PE b it in t h e cr0 s t a t u s re g is t e r. Th e PG b it in t h e cr0 re g is t e r is cle a re d , s o p a g in g is s t ill d is a b le d .

1 3 . Ju m p s t o t h e startup_32( ) a s s e m b ly la n g u a g e fu n ct io n .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

A.4 Renaissance: The startup_32( ) Functions Th e re a re t wo d iffe re n t startup_32( ) fu n ct io n s ; t h e o n e we re fe r t o h e re is co d e d in t h e a rch / i3 8 6 / b o o t / co m p re s s e d / h e a d . S file . Aft e r setup( ) t e rm in a t e s , t h e fu n ct io n h a s b e e n m o ve d e it h e r t o p h ys ica l a d d re s s 0x00100000 o r t o p h ys ica l a d d re s s 0x00001000, d e p e n d in g o n wh e t h e r t h e ke rn e l im a g e wa s lo a d e d h ig h o r lo w in RAM. Th is fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In it ia lize s t h e s e g m e n t a t io n re g is t e rs a n d a p ro vis io n a l s t a ck. 2 . Fills t h e a re a o f u n in it ia lize d d a t a o f t h e ke rn e l id e n t ifie d b y t h e _edata a n d _end s ym b o ls wit h ze ro s ( s e e S e ct io n 2 . 5 . 3 ) . 3 . In vo ke s t h e decompress_kernel( ) fu n ct io n t o d e co m p re s s t h e ke rn e l im a g e . Th e "Un co m p re s s in g Lin u x . . . " m e s s a g e is d is p la ye d firs t . Aft e r t h e ke rn e l im a g e is d e co m p re s s e d , t h e "O K, b o o t in g t h e ke rn e l. " m e s s a g e is s h o wn . If t h e ke rn e l im a g e wa s lo a d e d lo w, t h e d e co m p re s s e d ke rn e l is p la ce d a t p h ys ica l a d d re s s 0x00100000. Ot h e rwis e , if t h e ke rn e l im a g e wa s lo a d e d h ig h , t h e d e co m p re s s e d ke rn e l is p la ce d in a t e m p o ra ry b u ffe r lo ca t e d a ft e r t h e co m p re s s e d im a g e . Th e d e co m p re s s e d im a g e is t h e n m o ve d in t o it s fin a l p o s it io n , wh ich s t a rt s a t p h ys ica l a d d re s s 0x00100000.

4 . Ju m p s t o p h ys ica l a d d re s s 0x00100000.

Th e d e co m p re s s e d ke rn e l im a g e b e g in s wit h a n o t h e r startup_32( ) fu n ct io n in clu d e d in t h e a rch / i3 8 6 / k e rn e l/ h e a d . S file . Us in g t h e s a m e n a m e fo r b o t h t h e fu n ct io n s d o e s n o t cre a t e a n y p ro b le m s ( b e s id e s co n fu s in g o u r re a d e rs ) , s in ce b o t h fu n ct io n s a re e xe cu t e d b y ju m p in g t o t h e ir in it ia l p h ys ica l a d d re s s e s . Th e s e co n d startup_32( ) fu n ct io n s e t s u p t h e e xe cu t io n e n viro n m e n t fo r t h e firs t Lin u x p ro ce s s ( p ro ce s s 0 ) . Th e fu n ct io n p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . In it ia lize s t h e s e g m e n t a t io n re g is t e rs wit h t h e ir fin a l va lu e s . 2 . S e t s u p t h e Ke rn e l Mo d e s t a ck fo r p ro ce s s 0 ( s e e S e ct io n 3 . 4 . 2 ) . 3 . In it ia lize s t h e p ro vis io n a l ke rn e l Pa g e Ta b le s co n t a in e d in swapper_pg_dir a n d pg0 t o id e n t ica lly m a p t h e lin e a r a d d re s s e s t o t h e s a m e p h ys ica l a d d re s s e s , a s e xp la in e d in S e ct io n 2 . 5 . 5 . 4 . S t o re s t h e a d d re s s o f t h e Pa g e Glo b a l Dire ct o ry in t h e cr3 re g is t e r, a n d e n a b le s p a g in g b y s e t t in g t h e PG b it in t h e cr0 re g is t e r.

5 . Fills t h e b s s s e g m e n t o f t h e ke rn e l ( s e e S e ct io n 2 0 . 1 . 4 ) wit h ze ro s . 6 . In vo ke s setup_idt( ) t o fill t h e IDT wit h n u ll in t e rru p t h a n d le rs ( s e e S e ct io n

4.4.2). 7 . Pu t s t h e s ys t e m p a ra m e t e rs o b t a in e d fro m t h e BIOS a n d t h e p a ra m e t e rs p a s s e d t o t h e o p e ra t in g s ys t e m in t o t h e firs t p a g e fra m e ( s e e S e ct io n 2 . 5 . 3 ) . 8 . Id e n t ifie s t h e m o d e l o f t h e p ro ce s s o r. 9 . Lo a d s t h e gdtr a n d idtr re g is t e rs wit h t h e a d d re s s e s o f t h e GDT a n d IDT t a b le s .

1 0 . Ju m p s t o t h e start_kernel( ) fu n ct io n .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

A.5 Modern Age: The start_kernel( ) Function Th e start_kernel( ) fu n ct io n co m p le t e s t h e in it ia liza t io n o f t h e Lin u x ke rn e l. Ne a rly e ve ry ke rn e l co m p o n e n t is in it ia lize d b y t h is fu n ct io n ; we m e n t io n ju s t a fe w o f t h e m : ●

Th e Pa g e Ta b le s a re in it ia lize d b y in vo kin g t h e paging_init( ) fu n ct io n ( s e e S e ct io n 2 . 5 . 5 ) .



Th e p a g e d e s crip t o rs a re in it ia lize d b y t h e kmem_init( ), free_area_init( ), a n d mem_init( ) fu n ct io n s ( s e e S e ct io n 7 . 1 . 4 ) .



Th e fin a l in it ia liza t io n o f t h e IDT is p e rfo rm e d b y in vo kin g trap_init( ) ( s e e S e ct io n 4 . 5 ) a n d init_IRQ( ) ( s e e S e ct io n 4 . 6 . 1 . 2 ) .



Th e s la b a llo ca t o r is in it ia lize d b y t h e kmem_cache_init( ) a n d



kmem_cache_sizes_init( ) fu n ct io n s ( s e e S e ct io n 7 . 2 . 4 ) . Th e s ys t e m d a t e a n d t im e a re in it ia lize d b y t h e time_init( ) fu n ct io n ( s e e S e ct io n 6.1.1).



Th e ke rn e l t h re a d fo r p ro ce s s 1 is cre a t e d b y in vo kin g t h e kernel_thread( ) fu n ct io n . In t u rn , t h is ke rn e l t h re a d cre a t e s t h e o t h e r ke rn e l t h re a d s a n d e xe cu t e s t h e / s b in / in it p ro g ra m , a s d e s crib e d in S e ct io n 3 . 4 . 2 in Ch a p t e r 3 .

Be s id e s t h e "Lin u x ve rs io n 2 . 4 . 1 8 . . . " m e s s a g e , wh ich is d is p la ye d rig h t a ft e r t h e b e g in n in g o f start_kernel( ), m a n y o t h e r m e s s a g e s a re d is p la ye d in t h is la s t p h a s e , b o t h b y t h e in it fu n ct io n s a n d b y t h e ke rn e l t h re a d s . At t h e e n d , t h e fa m ilia r lo g in p ro m p t a p p e a rs o n t h e co n s o le ( o r in t h e g ra p h ica l s cre e n , if t h e X Win d o w S ys t e m is la u n ch e d a t s t a rt u p ) , t e llin g t h e u s e r t h a t t h e Lin u x ke rn e l is u p a n d ru n n in g . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Appendix B. Modules As s t a t e d in Ch a p t e r 1 , m o d u le s a re Lin u x's re cip e fo r e ffe ct ive ly a ch ie vin g m a n y o f t h e t h e o re t ica l a d va n t a g e s o f m icro ke rn e ls wit h o u t in t ro d u cin g p e rfo rm a n ce p e n a lt ie s . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

B.1 To Be (a Module) or Not to Be? Wh e n s ys t e m p ro g ra m m e rs wa n t t o a d d n e w fu n ct io n a lit y t o t h e Lin u x ke rn e l, t h e y a re fa ce d wit h a b a s ic d e cis io n : s h o u ld t h e y writ e t h e n e w co d e s o t h a t it will b e co m p ile d a s a m o d u le , o r s h o u ld t h e y s t a t ica lly lin k t h e n e w co d e t o t h e ke rn e l? As a g e n e ra l ru le , s ys t e m p ro g ra m m e rs t e n d t o im p le m e n t n e w co d e a s a m o d u le . Be ca u s e m o d u le s ca n b e lin ke d o n d e m a n d ( a s we s e e la t e r) , t h e ke rn e l d o e s n o t h a ve t o b e b lo a t e d wit h h u n d re d s o f s e ld o m - u s e d p ro g ra m s . Ne a rly e ve ry h ig h e r- le ve l co m p o n e n t o f t h e Lin u x ke rn e l—file s ys t e m s , d e vice d rive rs , e xe cu t a b le fo rm a t s , n e t wo rk la ye rs , a n d s o o n —ca n b e co m p ile d a s a m o d u le . Ho we ve r, s o m e Lin u x co d e m u s t n e ce s s a rily b e lin ke d s t a t ica lly, wh ich m e a n s t h a t e it h e r t h e co rre s p o n d in g co m p o n e n t is in clu d e d in t h e ke rn e l o r it is n o t co m p ile d a t a ll. Th is h a p p e n s t yp ica lly wh e n t h e co m p o n e n t re q u ire s a m o d ifica t io n t o s o m e d a t a s t ru ct u re o r fu n ct io n s t a t ica lly lin ke d in t h e ke rn e l. Fo r e xa m p le , s u p p o s e t h e co m p o n e n t h a s t o in t ro d u ce n e w fie ld s in t o t h e p ro ce s s d e s crip t o r. Lin kin g a m o d u le ca n n o t ch a n g e a n a lre a d y d e fin e d d a t a s t ru ct u re like task_struct s in ce , e ve n if t h e m o d u le u s e s it s m o d ifie d ve rs io n o f t h e d a t a s t ru ct u re , a ll s t a t ica lly lin ke d co d e co n t in u e s t o s e e t h e o ld ve rs io n . Da t a co rru p t io n e a s ily o ccu rs . A p a rt ia l s o lu t io n t o t h e p ro b le m co n s is t s o f "s t a t ica lly" a d d in g t h e n e w fie ld s t o t h e p ro ce s s d e s crip t o r, t h u s m a kin g t h e m a va ila b le t o t h e ke rn e l co m p o n e n t n o m a t t e r h o w it h a s b e e n lin ke d . Ho we ve r, if t h e ke rn e l co m p o n e n t is n e ve r u s e d , s u ch e xt ra fie ld s re p lica t e d in e ve ry p ro ce s s d e s crip t o r a re a wa s t e o f m e m o ry. If t h e n e w ke rn e l co m p o n e n t in cre a s e s t h e s ize o f t h e p ro ce s s d e s crip t o r a lo t , o n e wo u ld g e t b e t t e r s ys t e m p e rfo rm a n ce b y a d d in g t h e re q u ire d fie ld s in t h e d a t a s t ru ct u re o n ly if t h e co m p o n e n t is s t a t ica lly lin ke d t o t h e ke rn e l. As a s e co n d e xa m p le , co n s id e r a ke rn e l co m p o n e n t t h a t h a s t o re p la ce s t a t ica lly lin ke d co d e . It 's p re t t y cle a r t h a t n o s u ch co m p o n e n t ca n b e co m p ile d a s a m o d u le b e ca u s e t h e ke rn e l ca n n o t ch a n g e t h e m a ch in e co d e a lre a d y in RAM wh e n lin kin g t h e m o d u le . Fo r in s t a n ce , it is n o t p o s s ib le t o lin k a m o d u le t h a t ch a n g e s t h e wa y p a g e fra m e s a re a llo ca t e d , s in ce t h e Bu d d y s ys t e m fu n ct io n s a re a lwa ys s t a t ica lly lin ke d t o t h e ke rn e l. [ 1 ] [1]

Yo u m ig h t wo n d e r wh y n o t a ll ke rn e l co m p o n e n t s h a ve b e e n m o d u la rize d . Act u a lly, t h e re is n o s t ro n g t e ch n ica l re a s o n b e ca u s e it is e s s e n t ia lly a s o ft wa re lice n s e is s u e . Ke rn e l d e ve lo p e rs wa n t t o m a ke s u re t h a t co re co m p o n e n t s will n e ve r b e re p la ce d b y p ro p rie t a ry co d e re le a s e d t h ro u g h b in a ry- o n ly "b la ckb o x" m o d u le s .

Th e ke rn e l h a s t wo ke y t a s ks t o p e rfo rm in m a n a g in g m o d u le s . Th e firs t t a s k is m a kin g s u re t h e re s t o f t h e ke rn e l ca n re a ch t h e m o d u le 's g lo b a l s ym b o ls , s u ch a s t h e e n t ry p o in t t o it s m a in fu n ct io n . A m o d u le m u s t a ls o kn o w t h e a d d re s s e s o f s ym b o ls in t h e ke rn e l a n d in o t h e r m o d u le s . Th u s , re fe re n ce s a re re s o lve d o n ce a n d fo r a ll wh e n a m o d u le is lin ke d . Th e s e co n d t a s k co n s is t s o f ke e p in g t ra ck o f t h e u s e o f m o d u le s , s o t h a t n o m o d u le is u n lo a d e d wh ile a n o t h e r m o d u le o r a n o t h e r p a rt o f t h e ke rn e l is u s in g it . A s im p le re fe re n ce co u n t ke e p s t ra ck o f e a ch m o d u le 's u s a g e . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

B.2 Module Implementation Mo d u le s a re s t o re d in t h e file s ys t e m a s ELF o b je ct file s a n d a re lin ke d t o t h e ke rn e l b y e xe cu t in g t h e in s m o d p ro g ra m ( s e e t h e la t e r s e ct io n , S e ct io n B. 3 ) . Fo r e a ch m o d u le , t h e ke rn e l a llo ca t e s a m e m o ry a re a co n t a in in g t h e fo llo win g d a t a : ●

A module o b je ct



A n u ll- t e rm in a t e d s t rin g t h a t re p re s e n t s t h e n a m e o f t h e m o d u le ( a ll m o d u le s s h o u ld h a ve u n iq u e n a m e s ) Th e co d e t h a t im p le m e n t s t h e fu n ct io n s o f t h e m o d u le



Th e module o b je ct d e s crib e s a m o d u le ; it s fie ld s a re s h o wn in Ta b le B- 1 . A s im p ly lin ke d lis t co lle ct s a ll module o b je ct s , wh e re t h e next fie ld o f e a ch o b je ct p o in t s t o t h e n e xt e le m e n t in t h e lis t . Th e firs t e le m e n t o f t h e lis t is a d d re s s e d b y t h e module_list va ria b le . Bu t a ct u a lly, t h e firs t e le m e n t o f t h e lis t is a lwa ys t h e s a m e : it is n a m e d kernel_module a n d re fe rs t o a fict it io u s m o d u le re p re s e n t in g t h e s t a t ica lly lin ke d ke rn e l co d e .

Ta b le B- 1 . Th e m o d u le o b je c t

Ty p e

Na m e

D e s c rip t io n

unsigned long

size_of_struct S ize o f module o b je ct

struct module *

next

Ne xt lis t e le m e n t

const char *

name

Po in t e r t o m o d u le n a m e

unsigned long

size

Mo d u le s ize

atomic_t

uc.usecount

Mo d u le u s a g e co u n t e r

unsigned long

flags

Mo d u le fla g s

unsigned int

nsyms

Nu m b e r o f e xp o rt e d s ym b o ls

unsigned int

ndeps

Nu m b e r o f re fe re n ce d m o d u le s

struct module_symbol *

syms

Ta b le o f e xp o rt e d s ym b o ls

struct module_ref *

deps

Lis t o f re fe re n ce d m o d u le s

struct module_ref *

refs

Lis t o f re fe re n cin g m o d u le s

int (*)(void)

init

In it ia liza t io n m e t h o d

void (*)(void)

cleanup

Cle a n u p m e t h o d

struct exception_table_entry * ex_table_start S t a rt o f e xce p t io n t a b le struct exception_table_entry * ex_table_end

struct module_persist *

En d o f e xce p t io n t a b le

persist_start S t a rt o f a re a co n t a in in g m o d u le 's p e rs is t e n t d a t a

struct module_persist *

persist_end

En d o f a re a co n t a in in g m o d u le 's p e rs is t e n t d a t a

int (*)(void)

can_unload

Re t u rn 1 if t h e m o d u le is cu rre n t ly u n u s e d

int

runsize

No t u s e d

char *

kallsyms_start S t a rt o f a re a s t o rin g ke rn e l s ym b o ls fo r d e b u g g in g

En d o f a re a s t o rin g ke rn e l s ym b o ls fo r d e b u g g in g

char *

kallsyms_end

char *

archdata_start S t a rt o f a rch it e ct u re - d e p e n d e n t d a t a a re a

char *

archdata_end

En d o f a rch it e ct u re - d e p e n d e n t d a t a a re a

char *

kernel_data

No t u s e d

Th e t o t a l s ize o f t h e m e m o ry a re a a llo ca t e d fo r t h e m o d u le ( in clu d in g t h e module o b je ct a n d t h e m o d u le n a m e ) is co n t a in e d in t h e size fie ld .

As a lre a d y m e n t io n e d in S e ct io n 9 . 2 . 6 in Ch a p t e r 9 , e a ch m o d u le h a s it s o wn e xce p t io n t a b le . Th e t a b le in clu d e s t h e a d d re s s e s o f t h e fixu p co d e o f t h e m o d u le , if a n y. Th e t a b le is co p ie d in t o RAM wh e n t h e m o d u le is lin ke d , a n d it s s t a rt in g a n d e n d in g a d d re s s e s a re s t o re d in t h e ex_table_start a n d ex_table_end fie ld s o f t h e module o b je ct .

Th e fie ld s b e lo w ex_table_end we re in t ro d u ce d in Lin u x 2 . 4 a n d im p le m e n t s o m e

a d va n ce d fe a t u re s o f m o d u le s . Fo r in s t a n ce , it is n o w p o s s ib le t o re co rd in a d is k file d a t a t h a t s h o u ld b e p re s e rve d a cro s s lo a d in g a n d u n lo a d in g o f a m o d u le . Ne w m o d u le s u p p o rt a ls o o ffe rs a lo t o f d e b u g g in g d a t a t o ke rn e l d e b u g g e rs , s o ca t ch in g a b u g h id d e n in t h e co d e o f a m o d u le is n o w a lo t e a s ie r.

B.2.1 Module Usage Counter Ea ch m o d u le h a s a u s a g e co u n t e r, s t o re d in t h e uc.usecount fie ld o f t h e co rre s p o n d in g

module o b je ct . Th e co u n t e r is in cre m e n t e d wh e n a n o p e ra t io n in vo lvin g t h e m o d u le 's fu n ct io n s is s t a rt e d a n d d e cre m e n t e d wh e n t h e o p e ra t io n t e rm in a t e s . A m o d u le ca n b e u n lin ke d o n ly if it s u s a g e co u n t e r is 0 . Fo r e xa m p le , s u p p o s e t h a t t h e MS - DOS file s ys t e m la ye r is co m p ile d a s a m o d u le a n d t h e m o d u le is lin ke d a t ru n t im e . In it ia lly, t h e m o d u le u s a g e co u n t e r is 0 . If t h e u s e r m o u n t s a n MS - DOS flo p p y d is k, t h e m o d u le u s a g e co u n t e r is in cre m e n t e d b y 1 . Co n ve rs e ly, wh e n t h e u s e r u n m o u n t s t h e flo p p y d is k, t h e co u n t e r is d e cre m e n t e d b y 1 . Be s id e s t h is s im p le m e ch a n is m , Lin u x 2 . 4 's m o d u le s m a y a ls o d e fin e a cu s t o m fu n ct io n wh o s e a d d re s s is s t o re d in t h e can_unload fie ld o f t h e m o d u le o b je ct . Th e fu n ct io n is in vo ke d wh e n t h e m o d u le is b e in g u n lin ke d ; it s h o u ld ch e ck wh e t h e r it is re a lly s a fe t o u n lo a d t h e m o d u le , a n d re t u rn 0 o r 1 a cco rd in g ly. If t h e fu n ct io n re t u rn s 0 , t h e u n lo a d in g o p e ra t io n is a b o rt e d , m u ch a s if t h e u s a g e co u n t e r we re n o t e q u a l t o 0 .

B.2.2 Exporting Symbols Wh e n lin kin g a m o d u le , a ll re fe re n ce s t o g lo b a l ke rn e l s ym b o ls ( va ria b le s a n d fu n ct io n s ) in t h e m o d u le 's o b je ct co d e m u s t b e re p la ce d wit h s u it a b le a d d re s s e s . Th is o p e ra t io n , wh ich is ve ry s im ila r t o t h a t p e rfo rm e d b y t h e lin ke r wh ile co m p ilin g a Us e r Mo d e p ro g ra m ( s e e S e ct io n 2 0 . 1 . 3 in Ch a p t e r 2 0 ) , is d e le g a t e d t o t h e in s m o d e xt e rn a l p ro g ra m ( d e s crib e d la t e r in t h e s e ct io n , S e ct io n B. 3 ) . A s p e cia l t a b le is u s e d b y t h e ke rn e l t o s t o re t h e s ym b o ls t h a t ca n b e a cce s s e d b y m o d u le s t o g e t h e r wit h t h e ir co rre s p o n d in g a d d re s s e s . Th is k e rn e l s y m b o l t a b le is co n t a in e d in t h e _

_ksymtab s e ct io n o f t h e ke rn e l co d e s e g m e n t , a n d it s s t a rt in g a n d e n d in g a d d re s s e s a re id e n t ifie d b y t wo s ym b o ls p ro d u ce d b y t h e C co m p ile r: _ _start_ _ _ksymtab a n d _ _stop_ _ _ksymtab. Th e EXPORT_SYMBOL m a cro , wh e n u s e d in s id e t h e s t a t ica lly lin ke d ke rn e l co d e , fo rce s t h e C co m p ile r t o a d d a s p e cifie d s ym b o l t o t h e t a b le . On ly t h e ke rn e l s ym b o ls a ct u a lly u s e d b y s o m e e xis t in g m o d u le a re in clu d e d in t h e t a b le . S h o u ld a s ys t e m p ro g ra m m e r n e e d , wit h in s o m e m o d u le , t o a cce s s a ke rn e l s ym b o l t h a t is n o t a lre a d y e xp o rt e d , h e ca n s im p ly a d d t h e co rre s p o n d in g EXPORT_SYMBOL m a cro in t o t h e k e rn e l/ k s y m s . c file o f t h e Lin u x s o u rce co d e . Lin ke d m o d u le s ca n a ls o e xp o rt t h e ir o wn s ym b o ls s o t h a t o t h e r m o d u le s ca n a cce s s t h e m . Th e m o d u le s y m b o l t a b le is co n t a in e d in t h e _ _ksymtab s e ct io n o f t h e m o d u le co d e s e g m e n t . If t h e m o d u le s o u rce co d e in clu d e s t h e EXPORT_NO_SYMBOLS m a cro , s ym b o ls fro m t h a t m o d u le a re n o t a d d e d t o t h e t a b le . To e xp o rt a s u b s e t o f s ym b o ls fro m t h e m o d u le , t h e p ro g ra m m e r m u s t d e fin e t h e EXPORT_SYMTAB m a cro b e fo re in clu d in g t h e in clu d e / lin u x / m o d u le . h h e a d e r file . Th e n h e m a y u s e t h e EXPORT_SYMBOL m a cro t o e xp o rt a s p e cific s ym b o l. If n e it h e r EXPORT_NO_SYMBOLS n o r EXPORT_SYMTAB a p p e a rs in t h e m o d u le s o u rce co d e , a ll n o n s t a t ic g lo b a l s ym b o ls o f t h e m o d u le s a re e xp o rt e d .

Th e s ym b o l t a b le in t h e _ _ksymtab s e ct io n is co p ie d in t o a m e m o ry a re a wh e n t h e m o d u le is lin ke d , a n d t h e a d d re s s o f t h e a re a is s t o re d in t h e syms fie ld o f t h e module o b je ct . Th e s ym b o ls e xp o rt e d b y t h e s t a t ica lly lin ke d ke rn e l a n d a ll lin ke d - in m o d u le s ca n b e re t rie ve d b y re a d in g t h e / p ro c/ k s y m s file o r u s in g t h e query_module( ) s ys t e m ca ll ( d e s crib e d in t h e la t e r s e ct io n , S e ct io n B. 3 ) . Re ce n t ly, a n e w EXPORT_SYMBOL_GPL m a cro wa s a d d e d . It is fu n ct io n a lly e q u iva le n t t o

EXPORT_SYMBOL, b u t it m a rks t h e e xp o rt e d s ym b o l a s u s a b le o n ly in m o d u le s lice n s e d t h ro u g h e it h e r t h e Ge n e ra l Pu b lic Lice n s e ( GPL) lice n s e o r a co m p a t ib le o n e . Th is wa y, t h e a u t h o r o f a ke rn e l co m p o n e n t m a y fo rb id u s in g h is wo rk in b in a ry- o n ly m o d u le s t h a t d o n o t co m p ly wit h t h e s t a n d a rd re q u ire m e n t s o f t h e GPL. Th e lice n s e o f a m o d u le is s p e cifie d b y a MODULE_LICENSE m a cro in s e rt e d in t o it s s o u rce co d e , wh o s e a rg u m e n t is u s u a lly "GPL" o r "Pro p rie t a ry. "

B.2.3 Module Dependency A m o d u le ( B) ca n re fe r t o t h e s ym b o ls e xp o rt e d b y a n o t h e r m o d u le ( A) ; in t h is ca s e , we s a y t h a t B is lo a d e d o n t o p o f A, o r e q u iva le n t ly t h a t A is u s e d b y B. To lin k m o d u le B, m o d u le A m u s t h a ve a lre a d y b e e n lin ke d ; o t h e rwis e , t h e re fe re n ce s t o t h e s ym b o ls e xp o rt e d b y A ca n n o t b e p ro p e rly lin ke d in B. In s h o rt , t h e re is a d e p e n d e n cy b e t we e n m o d u le s . Th e deps fie ld o f t h e module o b je ct o f B p o in t s t o a lis t d e s crib in g a ll m o d u le s t h a t a re u s e d b y B; in o u r e xa m p le , A's module o b je ct wo u ld a p p e a r in t h a t lis t . Th e ndeps fie ld s t o re s t h e n u m b e r o f m o d u le s u s e d b y B. Co n ve rs e ly, t h e refs fie ld o f A p o in t s t o a lis t d e s crib in g a ll m o d u le s t h a t a re lo a d e d o n t o p o f A ( t h u s , B's module o b je ct is in clu d e d wh e n it is lo a d e d ) . Th e refs lis t m u s t b e u p d a t e d d yn a m ica lly wh e n e ve r a m o d u le is lo a d e d o n t o p o f A. To e n s u re t h a t m o d u le A is n o t re m o ve d b e fo re B, A's u s a g e co u n t e r is in cre m e n t e d fo r e a ch m o d u le lo a d e d o n t o p o f it . Be s id e A a n d B t h e re co u ld b e , o f co u rs e , a n o t h e r m o d u le ( C) lo a d e d o n t o p o f B, a n d s o o n . S t a ckin g m o d u le s is a n e ffe ct ive wa y t o m o d u la rize t h e ke rn e l s o u rce co d e t o s p e e d u p it s d e ve lo p m e n t a n d im p ro ve it s p o rt a b ilit y. I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

B.3 Linking and Unlinking Modules A u s e r ca n lin k a m o d u le in t o t h e ru n n in g ke rn e l b y e xe cu t in g t h e in s m o d e xt e rn a l p ro g ra m . Th is p ro g ra m p e rfo rm s t h e fo llo win g o p e ra t io n s : 1 . Re a d s fro m t h e co m m a n d lin e t h e n a m e o f t h e m o d u le t o b e lin ke d . 2 . Lo ca t e s t h e file co n t a in in g t h e m o d u le 's o b je ct co d e in t h e s ys t e m d ire ct o ry t re e . Th e file is u s u a lly p la ce d in s o m e s u b d ire ct o ry b e lo w / lib / m o d u le s . 3 . Co m p u t e s t h e s ize o f t h e m e m o ry a re a n e e d e d t o s t o re t h e m o d u le co d e , it s n a m e , a n d t h e module o b je ct .

4 . In vo ke s t h e create_module( ) s ys t e m ca ll, p a s s in g t o it t h e n a m e a n d s ize o f t h e n e w m o d u le . Th e co rre s p o n d in g sys_create_module( ) s e rvice ro u t in e p e rfo rm s t h e fo llo win g o p e ra t io n s : a . Ch e cks wh e t h e r t h e u s e r is a llo we d t o lin k t h e m o d u le ( t h e cu rre n t p ro ce s s m u s t h a ve t h e CAP_SYS_MODULE ca p a b ilit y) . In a n y s it u a t io n wh e re o n e is a d d in g fu n ct io n a lit y t o a ke rn e l, wh ich h a s a cce s s t o a ll d a t a a n d p ro ce s s e s o n t h e s ys t e m , s e cu rit y is a p a ra m o u n t co n ce rn . b . In vo ke s t h e find_module( ) fu n ct io n t o s ca n t h e module_list lis t o f

module o b je ct s lo o kin g fo r a m o d u le wit h t h e s p e cifie d n a m e . If it is fo u n d , t h e m o d u le h a s a lre a d y b e e n lin ke d , s o t h e s ys t e m ca ll t e rm in a t e s . c. In vo ke s vmalloc( ) t o a llo ca t e a m e m o ry a re a fo r t h e n e w m o d u le .

d . In it ia lize s t h e fie ld s o f t h e module o b je ct a t t h e b e g in n in g o f t h e m e m o ry a re a a n d co p ie s t h e n a m e o f t h e m o d u le rig h t b e lo w t h e o b je ct . e . In s e rt s t h e module o b je ct in t o t h e lis t p o in t e d t o b y module_list.

f. Re t u rn s t h e s t a rt in g a d d re s s o f t h e m e m o ry a re a a llo ca t e d t o t h e m o d u le . 5 . In vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_MODULES s u b co m m a n d t o g e t t h e n a m e o f a ll a lre a d y lin ke d m o d u le s . 6 . In vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_INFO s u b co m m a n d re p e a t e d ly, t o g e t t h e s t a rt in g a d d re s s a n d t h e s ize o f a ll m o d u le s t h a t a re a lre a d y lin ke d in . 7 . In vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_SYMBOLS s u b co m m a n d re p e a t e d ly, t o g e t t h e ke rn e l s ym b o l t a b le a n d t h e s ym b o l t a b le s o f a ll m o d u le s t h a t a re a lre a d y lin ke d in . 8 . Us in g t h e ke rn e l s ym b o l t a b le , t h e m o d u le s ym b o l t a b le s , a n d t h e a d d re s s re t u rn e d

b y t h e create_module( ) s ys t e m ca ll, t h e p ro g ra m re lo ca t e s t h e o b je ct co d e in clu d e d in t h e m o d u le 's file . Th is m e a n s re p la cin g a ll o ccu rre n ce s o f e xt e rn a l a n d g lo b a l s ym b o ls wit h t h e co rre s p o n d in g lo g ica l a d d re s s o ffs e t s . 9 . Allo ca t e s a m e m o ry a re a in t h e Us e r Mo d e a d d re s s s p a ce a n d lo a d s it wit h a co p y o f t h e module o b je ct , t h e m o d u le 's n a m e , a n d t h e m o d u le 's co d e re lo ca t e d fo r t h e ru n n in g ke rn e l. Th e a d d re s s fie ld s o f t h e o b je ct p o in t t o t h e re lo ca t e d co d e . In p a rt icu la r, t h e init fie ld is s e t t o t h e re lo ca t e d a d d re s s o f t h e m o d u le 's fu n ct io n n a m e d init_module( ), o r e q u iva le n t ly t o t h e a d d re s s o f t h e m o d u le 's fu n ct io n m a rke d b y t h e module_init m a cro . S im ila rly, t h e cleanup fie ld is s e t t o t h e re lo ca t e d a d d re s s o f t h e m o d u le 's cleanup_module( ) fu n ct io n o r, e q u iva le n t ly, t o t h e a d d re s s o f t h e m o d u le 's fu n ct io n m a rke d b y t h e module_exit m a cro . Ea ch m o d u le s h o u ld im p le m e n t t h e s e t wo fu n ct io n s . 1 0 . In vo ke s t h e init_module( ) s ys t e m ca ll, p a s s in g t o it t h e a d d re s s o f t h e Us e r Mo d e m e m o ry a re a s e t u p in t h e p re vio u s s t e p . Th e sys_init_module( ) s e rvice ro u t in e p e rfo rm s t h e fo llo win g o p e ra t io n s : a . Ch e cks wh e t h e r t h e u s e r is a llo we d t o lin k t h e m o d u le ( t h e cu rre n t p ro ce s s m u s t h a ve t h e CAP_SYS_MODULE ca p a b ilit y) .

b . In vo ke s find_module( ) t o fin d t h e p ro p e r module o b je ct in t h e lis t t o wh ich module_list p o in t s .

c. Ove rwrit e s t h e module o b je ct wit h t h e co n t e n t s o f t h e co rre s p o n d in g o b je ct in t h e Us e r Mo d e m e m o ry a re a . d . Pe rfo rm s a s e rie s o f s a n it y ch e cks o n t h e a d d re s s e s in t h e module o b je ct .

e . Co p ie s t h e re m a in in g p a rt o f t h e Us e r Mo d e m e m o ry a re a in t o t h e m e m o ry a re a a llo ca t e d t o t h e m o d u le . f. S ca n s t h e m o d u le lis t a n d in it ia lize s t h e ndeps a n d deps fie ld s o f t h e

module o b je ct . g . S e t s t h e m o d u le u s a g e co u n t e r t o 1 . h . Exe cu t e s t h e init m e t h o d o f t h e m o d u le t o in it ia lize t h e m o d u le 's d a t a s t ru ct u re s p ro p e rly. i. S e t s t h e m o d u le u s a g e co u n t e r t o 0 a n d re t u rn s . 1 1 . Re le a s e s t h e Us e r Mo d e m e m o ry a re a a n d t e rm in a t e s . To u n lin k a m o d u le , a u s e r in vo ke s t h e rm m o d e xt e rn a l p ro g ra m , wh ich p e rfo rm s t h e fo llo win g o p e ra t io n s :

1 . Fro m t h e co m m a n d lin e , re a d s t h e n a m e o f t h e m o d u le t o b e u n lin ke d . 2 . In vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_MODULES s u b co m m a n d t o g e t t h e lis t o f lin ke d m o d u le s . 3 . In vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_SYMBOLS s u b co m m a n d re p e a t e d ly, t o g e t t h e ke rn e l s ym b o l t a b le a n d t h e s ym b o l t a b le s o f a ll m o d u le s t h a t a re a lre a d y lin ke d in . 4 . If t h e o p t io n - r h a s b e e n p a s s e d t o rm m o d , in vo ke s t h e query_module( ) s ys t e m ca ll wit h t h e QM_REFS s u b co m m a n d s e ve ra l t im e s t o re t rie ve d e p e n d e n cy in fo rm a t io n o n t h e lin ke d m o d u le s . 5 . Bu ild s a lis t o f m o d u le s t o b e u n lo a d e d : if t h e o p t io n - r h a s n o t b e e n s p e cifie d , t h e lis t in clu d e s o n ly t h e m o d u le p a s s e d a s a rg u m e n t t o rm m o d ; o t h e rwis e , it in clu d e s t h e m o d u le p a s s e d a s a rg u m e n t a n d a ll lo a d e d m o d u le s t h a t u lt im a t e ly d e p e n d o n it . 6 . In vo ke s t h e delete_module( ) s ys t e m ca ll, p a s s in g t h e n a m e o f a m o d u le t o b e u n lo a d e d . Th e co rre s p o n d in g sys_delete_module( ) s e rvice ro u t in e p e rfo rm s t h e s e o p e ra t io n s : a . Ch e cks wh e t h e r t h e u s e r is a llo we d t o re m o ve t h e m o d u le ( t h e cu rre n t p ro ce s s m u s t h a ve t h e CAP_SYS_MODULE ca p a b ilit y) .

b . In vo ke s find_module( ) t o fin d t h e co rre s p o n d in g module o b je ct in t h e lis t t o wh ich module_list p o in t s .

c. Ch e cks wh e t h e r t h e refs fie ld is n u ll; o t h e rwis e , re t u rn s a n e rro r co d e .

d . If d e fin e d , in vo ke s t h e can_unload m e t h o d ; o t h e rwis e , ch e cks wh e t h e r t h e

uc.usecount fie ld s o f t h e module o b je ct is n u ll. If t h e m o d u le is b u s y, re t u rn s a n e rro r co d e . e . If d e fin e d , in vo ke s t h e cleanup m e t h o d t o p e rfo rm t h e o p e ra t io n s n e e d e d t o cle a n ly s h u t d o wn t h e m o d u le . Th e m e t h o d is u s u a lly im p le m e n t e d b y t h e cleanup_module( ) fu n ct io n d e fin e d in s id e t h e m o d u le .

f. S ca n s t h e deps lis t o f t h e m o d u le a n d re m o ve s t h e m o d u le fro m t h e refs lis t o f a n y e le m e n t fo u n d . g . Re m o ve s t h e m o d u le fro m t h e lis t t o wh ich module_list p o in t s .

h . In vo ke s vfree( ) t o re le a s e t h e m e m o ry a re a u s e d b y t h e m o d u le a n d re t u rn s 0 ( s u cce s s ) . 7 . If t h e lis t o f m o d u le s t o b e u n lo a d e d is n o t e m p t y, ju m p s b a ck t o S t e p 6 ; o t h e rwis e , t e rm in a t e s .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

B.4 Linking Modules on Demand A m o d u le ca n b e a u t o m a t ica lly lin ke d wh e n t h e fu n ct io n a lit y it p ro vid e s is re q u e s t e d a n d a u t o m a t ica lly re m o ve d a ft e rwa rd . Fo r in s t a n ce , s u p p o s e t h a t t h e MS - DOS file s ys t e m h a s n o t b e e n lin ke d , e it h e r s t a t ica lly o r d yn a m ica lly. If a u s e r t rie s t o m o u n t a n MS - DOS file s ys t e m , t h e mount( ) s ys t e m ca ll n o rm a lly fa ils b y re t u rn in g a n e rro r co d e , s in ce MS - DOS is n o t in clu d e d in t h e file_systems lis t o f re g is t e re d file s ys t e m s . Ho we ve r, if s u p p o rt fo r a u t o m a t ic lin kin g o f m o d u le s h a s b e e n s p e cifie d wh e n co n fig u rin g t h e ke rn e l, Lin u x m a ke s a n a t t e m p t t o lin k t h e MS - DOS m o d u le , a n d t h e n s ca n s t h e lis t o f re g is t e re d file s ys t e m s a g a in . If t h e m o d u le is s u cce s s fu lly lin ke d , t h e mount( ) s ys t e m ca ll ca n co n t in u e it s e xe cu t io n a s if t h e MS - DOS file s ys t e m we re p re s e n t fro m t h e b e g in n in g .

B.4.1 The modprobe Program To a u t o m a t ica lly lin k a m o d u le , t h e ke rn e l cre a t e s a ke rn e l t h re a d t o e xe cu t e t h e m o d p ro b e e xt e rn a l p ro g ra m , [ 2 ] wh ich t a ke s ca re o f p o s s ib le co m p lica t io n s d u e t o m o d u le d e p e n d e n cie s . Th e d e p e n d e n cie s we re d is cu s s e d e a rlie r: a m o d u le m a y re q u ire o n e o r m o re o t h e r m o d u le s , a n d t h e s e in t u rn m a y re q u ire s t ill o t h e r m o d u le s . Fo r in s t a n ce , t h e MS - DOS m o d u le re q u ire s a n o t h e r m o d u le n a m e d fa t co n t a in in g s o m e co d e co m m o n t o a ll file s ys t e m s b a s e d o n a File Allo ca t io n Ta b le ( FAT) . Th u s , if it is n o t a lre a d y p re s e n t , t h e fa t m o d u le m u s t a ls o b e a u t o m a t ica lly lin ke d in t o t h e ru n n in g ke rn e l wh e n t h e MS - DOS m o d u le is re q u e s t e d . Re s o lvin g d e p e n d e n cie s a n d fin d in g m o d u le s is a t yp e o f a ct ivit y t h a t 's b e s t d o n e in Us e r Mo d e b e ca u s e it re q u ire s lo ca t in g a n d a cce s s in g m o d u le o b je ct file s in t h e file s ys t e m . [2]

Th is is o n e o f t h e fe w e xa m p le s in wh ich t h e ke rn e l re lie s o n a n e xt e rn a l p ro g ra m .

Th e m o d p ro b e e xt e rn a l p ro g ra m is s im ila r t o in s m o d , s in ce it lin ks in a m o d u le s p e cifie d o n t h e co m m a n d lin e . Ho we ve r, m o d p ro b e a ls o re cu rs ive ly lin ks in a ll m o d u le s u s e d b y t h e m o d u le s p e cifie d o n t h e co m m a n d lin e . Fo r in s t a n ce , if a u s e r in vo ke s m o d p ro b e t o lin k t h e MS - DOS m o d u le , t h e p ro g ra m lin ks t h e fa t m o d u le , if n e ce s s a ry, fo llo we d b y t h e MS - DOS m o d u le . Act u a lly, m o d p ro b e ju s t ch e cks fo r m o d u le d e p e n d e n cie s ; t h e a ct u a l lin kin g o f e a ch m o d u le is d o n e b y fo rkin g a n e w p ro ce s s a n d e xe cu t in g in s m o d . Ho w d o e s m o d p ro b e kn o w a b o u t m o d u le d e p e n d e n cie s ? An o t h e r e xt e rn a l p ro g ra m n a m e d d e p m o d is e xe cu t e d a t s ys t e m s t a rt u p . It lo o ks a t a ll t h e m o d u le s co m p ile d fo r t h e ru n n in g ke rn e l, wh ich a re u s u a lly s t o re d in s id e t h e / lib / m o d u le s d ire ct o ry. Th e n it writ e s a ll m o d u le d e p e n d e n cie s t o a file n a m e d m o d u le s . d e p . Th e m o d p ro b e p ro g ra m ca n t h u s s im p ly co m p a re t h e in fo rm a t io n s t o re d in t h e file wit h t h e lis t o f lin ke d m o d u le s p ro d u ce d b y t h e query_module( ) s ys t e m ca ll.

B.4.2 The request_module( ) Function In s o m e ca s e s , t h e ke rn e l m a y in vo ke t h e request_module( ) fu n ct io n t o a t t e m p t a u t o m a t ic lin kin g fo r a m o d u le . Co n s id e r a g a in t h e ca s e o f a u s e r t ryin g t o m o u n t a n MS - DOS file s ys t e m . If t h e get_fs_type( ) fu n ct io n d is co ve rs t h a t t h e file s ys t e m is n o t re g is t e re d , it in vo ke s t h e

request_module( ) fu n ct io n in t h e h o p e t h a t MS - DOS h a s b e e n co m p ile d a s a m o d u le . If t h e request_module( ) fu n ct io n s u cce e d s in lin kin g t h e re q u e s t e d m o d u le ,

get_fs_type( ) ca n co n t in u e a s if t h e m o d u le we re a lwa ys p re s e n t . Of co u rs e , t h is d o e s n o t a lwa ys h a p p e n ; in o u r e xa m p le , t h e MS - DOS m o d u le m ig h t n o t h a ve b e e n co m p ile d a t a ll. In t h is ca s e , get_fs_type( ) re t u rn s a n e rro r co d e .

Th e request_module( ) fu n ct io n re ce ive s t h e n a m e o f t h e m o d u le t o b e lin ke d a s it s p a ra m e t e r. It in vo ke s kernel_thread( ) t o cre a t e a n e w ke rn e l t h re a d t h a t e xe cu t e s t h e

exec_modprobe( ) fu n ct io n . Th e n it s im p ly wa it s u n t il t h a t ke rn e l t h re a d t e rm in a t e s . Th e exec_modprobe( ) fu n ct io n , in t u rn , a ls o re ce ive s t h e n a m e o f t h e m o d u le t o b e lin ke d a s it s p a ra m e t e r. It in vo ke s t h e execve( ) s ys t e m ca ll a n d e xe cu t e s t h e m o d p ro b e e xt e rn a l p ro g ra m , [ 3 ] p a s s in g t h e m o d u le n a m e t o it . In t u rn , t h e m o d p ro b e p ro g ra m a ct u a lly lin ks t h e re q u e s t e d m o d u le , a lo n g wit h a n y t h a t it d e p e n d s o n . [3]

Th e n a m e a n d p a t h o f t h e p ro g ra m e xe cu t e d b y exec_modprobe( ) ca n b e cu s t o m ize d b y writ in g in t o t h e / p ro c/ s y s / k e rn e l/ m o d p ro b e file . Ea ch m o d u le a u t o m a t ica lly lin ke d in t o t h e ke rn e l h a s t h e MOD_AUTOCLEAN fla g in t h e flags fie ld o f t h e module o b je ct s e t . Th is fla g a llo ws a u t o m a t ic u n lin kin g o f t h e m o d u le wh e n it is n o lo n g e r u s e d . To a u t o m a t ica lly u n lin k t h e m o d u le , a s ys t e m p ro ce s s ( like cro n d ) p e rio d ica lly e xe cu t e s t h e rm m o d e xt e rn a l p ro g ra m , p a s s in g t h e - a o p t io n t o it . Th e la t t e r p ro g ra m e xe cu t e s t h e delete_module( ) s ys t e m ca ll wit h a NULL p a ra m e t e r. Th e co rre s p o n d in g s e rvice ro u t in e s ca n s t h e lis t o f module o b je ct s a n d re m o ve s a ll u n u s e d m o d u le s h a vin g t h e

MOD_AUTOCLEAN fla g s e t .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Appendix C. Source Code Structure To h e lp yo u t o fin d yo u r wa y t h ro u g h t h e file s o f t h e s o u rce co d e , we b rie fly d e s crib e t h e o rg a n iza t io n o f t h e ke rn e l d ire ct o ry t re e . As u s u a l, a ll p a t h n a m e s re fe r t o t h e m a in d ire ct o ry o f t h e Lin u x ke rn e l, wh ich is , in m o s t Lin u x d is t rib u t io n s , / u s r/ s rc/ lin u x . Lin u x s o u rce co d e fo r a ll s u p p o rt e d a rch it e ct u re s is co n t a in e d in a b o u t 8 7 5 0 C a n d As s e m b ly file s s t o re d in a b o u t 5 3 0 s u b d ire ct o rie s ; it co n s is t s o f a b o u t 4 m illio n lin e s o f co d e , wh ich o ccu p y m o re t h a n 1 4 4 m e g a b yt e s o f d is k s p a ce . Th e fo llo win g lis t illu s t ra t e s t h e d ire ct o ry t re e co n t a in in g t h e Lin u x s o u rce co d e . Ple a s e n o t ice t h a t o n ly t h e s u b d ire ct o rie s s o m e h o w re la t e d t o t h e t a rg e t o f t h is b o o k h a ve b e e n e xp a n d e d .

D ire c t o ry

D e s c rip t io n

Documentation

Te xt file s wit h g e n e ra l e xp la n a t io n s a n d h in t s a b o u t ke rn e l co m p o n e n t s

arch

Pla t fo rm - d e p e n d e n t co d e

i3 8 6

IBM's PC a rch it e ct u re

ke rn e l

Ke rn e l co re

mm

Me m o ry m a n a g e m e n t

m a th-e m u

S o ft wa re e m u la t o r fo r flo a t in g - p o in t u n it

lib

Ha rd wa re - d e p e n d e n t u t ilit y fu n ct io n s

boot

Bo o t s t ra p p in g

co m p re s s e d

Co m p re s s e d ke rn e l h a n d lin g

t o o ls

Pro g ra m s t o b u ild co m p re s s e d ke rn e l im a g e

a lp h a

He wle t t - Pa cka rd 's Alp h a a rch it e ct u re

a rm

Arch it e ct u re s b a s e d o n ARM p ro ce s s o rs

cris

Axis Co m m u n ica t io n AB's Co d e Re d u ce d In s t ru ct io n S e t a rch it e ct u re u s e d b y t h in - s e rve rs

ia 6 4

Wo rks t a t io n s b a s e d o n In t e l's 6 4 - b it It a n iu m m icro p ro ce s s o r

m 68k

Mo t o ro la 's MC6 8 0 x0 - b a s e d a rch it e ct u re

m ip s

MIPS a rch it e ct u re a d o p t e d b y S ilico n Gra p h ics a n d o t h e r co m p u t e r m a n u fa ct u re rs

m ip s 6 4

6 4 - b it MIPS a rch it e ct u re

p a ris c

HP 9 0 0 0 p a ris c wo rks t a t io n s

ppc

Mo t o ro la - IBM's Po we rPC- b a s e d a rch it e ct u re s

s390

IBM's ES A/ 3 9 0 a n d 3 2 - b it zS e rie s a rch it e ct u re s

s390x

IBM's 6 4 - b it zS e rie s a rch it e ct u re s

sh

S u p e rH- b a s e d e m b e d d e d co m p u t e rs

s p a rc

S u n 's S PARC a rch it e ct u re

s p a rc6 4

S u n 's Ult ra - S PARC a rch it e ct u re

drivers

De vice d rive rs

a co rn

Aco rn 's d e vice s

a cp i

Ad va n ce d Co n fig u ra t io n Po we r In t e rfa ce ( a p o we r m a n a g e m e n t s t a n d a rd t h a t p ro vid e s m o re fe a t u re s t h a n APM)

a tm

S u p p o rt fo r ATM n e t wo rk a rch it e ct u re

b lo ck

Blo ck d e vice d rive rs

p a rid e

S u p p o rt fo r a cce s s in g IDE d e vice s fro m p a ra lle l p o rt

b lu e t o o t h

Drive rs fo r d e vice s co n n e ct e d t h ro u g h t h e Blu e t o o t h wire le s s p ro t o co l

cd ro m

Pro p rie t a ry CD- ROM d e vice s ( n e it h e r ATAPI n o r S CS I)

ch a r

Ch a ra ct e r d e vice d rive rs

agp

Drive rs fo r AGP vid e o ca rd s

d rm

Drive r t h a t s u p p o rt s t h e Xfre e 8 6 Dire ct Re n d e rin g In fra s t ru ct u re

d rm - 4 . 0

An o t h e r d rive r t h a t s u p p o rt s t h e Xfre e 8 6 Dire ct Re n d e rin g In fra s t ru ct u re

ft a p e

Ta p e - s t re a m in g d e vice s

ip 2

Co m p u t o n e In t e llip o rt II m u lt ip o rt s e ria l co n t ro lle rs

jo ys t ick

Jo ys t icks

m wa ve

IBM's Win m o d e m - like d rive r fo r Lin u x

p cm cia

Drive r fo r PCMCIA s e ria l d e vice

rio

Drive r fo r t h e S p e cia lix Rio m u lt ip o rt s e ria l ca rd

d io

He wle t t - Pa cka rd 's HP3 0 0 DIO b u s s u p p o rt

fc4

Fib re Ch a n n e l d e vice s

h o t p lu g

S u p p o rt fo r h o t p lu g g in g o f PCI d e vice s

i2 c

Drive r fo r Ph ilip s ' I2 C 2 - wire b u s

id e

Drive rs fo r IDE d is ks

ie e e 1 3 9 4

Drive r fo r IEEE1 3 9 4 h ig h - s p e e d s e ria l b u s

in p u t

In p u t la ye r m o d u le fo r jo ys t icks , ke yb o a rd s , a n d m o u s e s

is d n

IS DN d e vice s

m a cin t o s h

Ap p le 's Ma cin t o s h d e vice s

md

La ye r fo r "m u lt ip le d e vice s " ( d is k a rra ys a n d Lo g ica l Vo lu m e Ma n a g e r)

m e d ia

Drive rs fo r ra d io a n d vid e o d e vice s

m essage

Hig h p e rfo rm a n ce S CS I + LAN/ Fib re Ch a n n e l d rive rs

m is c

Mis ce lla n e o u s d e vice s

m td

S u p p o rt fo r Me m o ry Te ch n o lo g y De vice s ( e s p e cia lly fla s h d e vice s )

net

Ne t wo rk ca rd d e vice s

nubus

Ap p le 's Ma cin t o s h Nu b u s s u p p o rt

p a rp o rt

Pa ra lle l p o rt s u p p o rt

p ci

PCI b u s s u p p o rt

p cm cia

PCMCIA ca rd s u p p o rt

pnp

Plu g - a n d - p la y s u p p o rt

s390

IBM's ES A/ 3 9 0 a n d zS e rie s d e vice s u p p o rt

sbus

S u n 's S PARC S Bu s s u p p o rt

s cs i

S CS I d e vice d rive rs

sgi

S ilico n Gra p h ics ' d e vice s

sound

Au d io ca rd d e vice s

tc

He wle t t - Pa cka rd ( fo rm e rly DEC) TURBOCh a n n e l b u s s u p p o rt

t e le p h o n y

S u p p o rt fo r vo ice - o ve r- IP d e vice s

usb

Un ive rs a l S e ria l Bu s ( US B) s u p p o rt

vid e o

Vid e o ca rd d e vice s

zo rro

Am ig a 's Zo rro b u s s u p p o rt

fs

File s ys t e m s

a d fs

Aco rn Dis c Filin g S ys t e m

a ffs

Am ig a 's Fa s t File S ys t e m ( FFS )

a u t o fs

S u p p o rt fo r ke rn e l- b a s e d file s ys t e m a u t o m o u n t e r d a e m o n

a u t o fs 4

An o t h e r ve rs io n o f s u p p o rt fo r ke rn e l- b a s e d file s ys t e m a u t o m o u n t e r d a e m o n ( Ve rs io n 4 )

b fs

S CO Un ixWa re Bo o t File S ys t e m

co d a

Co d a n e t wo rk file s ys t e m

cra m fs

Da t a co m p re s s in g file s ys t e m fo r MTD d e vice s

d e vfs

De vice file s ys t e m

d e vp t s

Ps e u d o t e rm in a l s u p p o rt ( Op e n Gro u p 's Un ix9 8 s t a n d a rd )

e fs

S GI IRIX's EFS file s ys t e m

e xt 2

Lin u x n a t ive Ext 2 file s ys t e m

e xt 3

Lin u x n a t ive Ext 3 file s ys t e m

fa t

Co m m o n co d e fo r FAT- b a s e d file s ys t e m s

fre e vxfs

Ve rit a s VxFS file s ys t e m u s e d b y S CO Un ixWa re

h fs

Ap p le 's Ma cin t o s h file s ys t e m

h p fs

IBM's OS / 2 file s ys t e m

in fla t e _ fs

La ye r fo r d e co m p re s s in g file s in cra m fs a n d is o 9 6 6 0 file s ys t e m s

in t e rm e zzo

In t e rMe zzo h ig h - a va ila b ilit y d is t rib u t e d file s ys t e m

is o fs

IS O9 6 6 0 file s ys t e m ( CD- ROM)

jb d

Jo u rn a lin g file s ys t e m la ye r u s e d b y Ext 3

jffs

Jo u rn a lin g file s ys t e m s fo r MTD d e vice s

jffs 2

An o t h e r jo u rn a lin g file s ys t e m s fo r MTD d e vice s

lo ckd

Re m o t e file lo ckin g s u p p o rt

m in ix

MINIX file s ys t e m

m sdos

Micro s o ft 's MS - DOS file s ys t e m

n cp fs

No ve ll's Ne t wa re Co re Pro t o co l ( NCP)

n fs

Ne t wo rk File S ys t e m ( NFS )

n fs d

In t e g ra t e d Ne t wo rk file s ys t e m s e rve r

n ls

Na t ive La n g u a g e S u p p o rt

n t fs

Micro s o ft 's Win d o ws NT file s ys t e m

o p e n p ro m fs

S p e cia l file s ys t e m fo r S PARC's Op e n PROM d e vice t re e

p a rt it io n s

Co d e fo r re a d in g s e ve ra l d is k p a rt it io n fo rm a t s

p ro c

/ p ro c virt u a l file s ys t e m

q n x4

File s ys t e m fo r QNX 4 OS

ra m fs

S im p le RAM file s ys t e m

re is e rfs

Re is e r file s ys t e m

ro m fs

S m a ll re a d - o n ly file s ys t e m

s m b fs

Micro s o ft 's Win d o ws S e rve r Me s s a g e Blo ck ( S MB) file s ys t e m

s ys v

S ys t e m V, S CO, Xe n ix, Co h e re n t , a n d Ve rs io n 7 file s ys t e m

udf

Un ive rs a l Dis k Fo rm a t file s ys t e m ( DVD)

u fs

Un ix BS D, S u n OS , Fre e BS D, Op e n BS D, a n d Ne XTS t e p file s ys t e m

um sdos

UMS DOS file s ys t e m

vfa t

Micro s o ft 's Win d o ws file s ys t e m ( VFAT)

include

He a d e r file s ( . h )

a s m - g e n e ric Pla t fo rm - in d e p e n d e n t lo w- le ve l h e a d e r file s

a s m - i3 8 6

IBM's PC a rch it e ct u re

a s m - xxx

He a d e r file s fo r o t h e r a rch it e ct u re

lin u x

Ke rn e l co re

b yt e o rd e r

Byt e - s wa p p in g fu n ct io n s

is d n

IS DN fu n ct io n s

lo ckd

Re m o t e file lo ckin g

m td

MTD d e vice s

n e t filt e r_ ip v4 Filt e rin g fo r TCP/ IPv4

n e t filt e r_ ip v6 Filt e rin g fo r TCP/ IPv6

n fs d

In t e g ra t e d Ne t wo rk File S e rve r

ra id

RAID d is ks

s u n rp c

S u n 's Re m o t e Pro ce d u re Ca ll

m a th-e m u

Ma t h e m a t ica l co p ro ce s s o r e m u la t io n

net

Ne t wo rkin g

p cm cia

PCMCIA s u p p o rt

s cs i

S CS I s u p p o rt

vid e o

Fra m e b u ffe r s u p p o rt

init

Ke rn e l in it ia liza t io n co d e

ipc

S ys t e m V's In t e rp ro ce s s Co m m u n ica t io n

kernel

Ke rn e l co re : p ro ce s s e s , t im in g , p ro g ra m e xe cu t io n , s ig n a ls , m o d u le s , e t c.

lib

Ge n e ra l- p u rp o s e ke rn e l fu n ct io n s

mm

Me m o ry h a n d lin g

net

A b u n ch o f n e t wo rkin g p ro t o co ls

scripts

Ext e rn a l p ro g ra m s fo r b u ild in g t h e ke rn e l im a g e

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Bibliography Th is b ib lio g ra p h y is b ro ke n d o wn b y s u b je ct a re a a n d lis t s s o m e o f t h e m o s t co m m o n a n d , in o u r o p in io n , u s e fu l b o o ks a n d o n lin e d o cu m e n t a t io n o n t h e t o p ic o f ke rn e ls .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Books on Unix Kernels ●







Ba ch , M. J. Th e De s ig n o f t h e Un ix Op e ra t in g S y s t e m . Pre n t ice Ha ll In t e rn a t io n a l, In c. , 1 9 8 6 . A cla s s ic b o o k d e s crib in g t h e S VR2 ke rn e l. Go o d h e a rt , B. a n d J. Co x. Th e Ma g ic Ga rd e n Ex p la in e d : Th e In t e rn a ls o f t h e Un ix S y s t e m V Re le a s e 4 . Pre n t ice Ha ll In t e rn a t io n a l, In c. , 1 9 9 4 . An e xce lle n t b o o k o n t h e S VR4 ke rn e l. McKu s ick. M. K. , M. J. Ka re ls , a n d K. Bo s t ic. Th e De s ig n a n d Im p le m e n t a t io n o f t h e 4 . 4 BS D Op e ra t in g S y s t e m . Ad d is o n We s le y, 1 9 8 6 . Pe rh a p s t h e m o s t a u t h o rit a t ive b o o k o n t h e 4 . 4 BS D ke rn e l. Va h a lia , U. Un ix In t e rn a ls : Th e Ne w Fro n t ie rs . Pre n t ice Ha ll, In c. , 1 9 9 6 . A va lu a b le b o o k t h a t p ro vid e s p le n t y o f in s ig h t o n m o d e rn Un ix ke rn e l d e s ig n is s u e s . It in clu d e s a rich b ib lio g ra p h y.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Books on the Linux Kernel ●









Be ck, M. , H. Bo e h m e , M. Dzia d zka , U. Ku n it z, R. Ma g n u s , C. S ch ro t e r, a n d D. Ve rwo rn e r. Lin u x Ke rn e l Pro g ra m m in g ( 3 rd e d . ) . Ad d is o n We s le y, 2 0 0 2 . A h a rd wa re in d e p e n d e n t b o o k co ve rin g t h e Lin u x 2 . 4 ke rn e l. Ma xwe ll, S . Lin u x Co re Ke rn e l Co m m e n t a ry . Th e Co rio lis Gro u p , LLC, 1 9 9 9 . A lis t in g o f p a rt o f t h e Lin u x ke rn e l s o u rce co d e wit h s o m e in t e re s t in g co m m e n t s a t t h e e n d o f t h e b o o k. Mo s b e rg e r, D. , S . Era n ia n , a n d B. Pe re n s . IA- 6 4 Lin u x Ke rn e l: De s ig n a n d Im p le m e n t a t io n . Pre n t ice Ha ll, In c. , 2 0 0 2 . An e xce lle n t d e s crip t io n o f t h e h a rd wa re d e p e n d e n t Lin u x ke rn e l fo r t h e It a n iu m IA- 6 4 m icro p ro ce s s o r. Ru b in i, A. , J. Co rb e t . Lin u x De v ice Driv e rs ( 2 n d e d . ) . O'Re illy & As s o cia t e s , In c. , 2 0 0 1 . A va lu a b le b o o k t h a t is s o m e wh a t co m p le m e n t a ry t o t h is o n e . It g ive s p le n t y o f in fo rm a t io n o n h o w t o d e ve lo p d rive rs fo r Lin u x. S a t ch e ll S . , H. Cliffo rd . Lin u x IP S t a ck s Co m m e n t a ry . Th e Co rio lis Gro u p , LLC, 2 0 0 0 . A lis t in g o f p a rt o f t h e Lin u x ke rn e l n e t wo rkin g s o u rce co d e wit h s o m e co m m e n t s a t t h e e n d o f t h e b o o k.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Books on PC Architecture and Technical Manuals on Intel Microprocessors ●





In t e l, In t e l Arch it e ct u re S o ft w a re De v e lo p e r's Ma n u a l, v o l. 3 : S y s t e m Pro g ra m m in g . 1 9 9 9 . De s crib e s t h e In t e l Pe n t iu m m icro p ro ce s s o r a rch it e ct u re . It ca n b e d o wn lo a d e d fro m h t t p : / / d e ve lo p e r. in t e l. co m / d e s ig n / p e n t iu m ii/ m a n u a ls / 2 4 3 1 9 2 0 2 . p d f. In t e l, Mu lt iPro ce s s o r S p e cifica t io n , Ve rs io n 1 . 4 . 1 9 9 7 . De s crib e s t h e In t e l m u lt ip ro ce s s o r a rch it e ct u re s p e cifica t io n s . It ca n b e d o wn lo a d e d fro m h t t p : / / www. in t e l. co m / d e s ig n / p e n t iu m / d a t a s h t s / 2 4 2 0 1 6 . h t m . Me s s m e r, H. P. Th e In d is p e n s a b le PC Ha rd w a re Bo o k ( 3 rd e d . ) . Ad d is o n We s le y Lo n g m a n Lim it e d , 1 9 9 7 . A va lu a b le re fe re n ce t h a t e xh a u s t ive ly d e s crib e s t h e m a n y co m p o n e n t s o f a PC.

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Other Online Documentation Sources Lin u x s o u rce co d e Th e o fficia l s it e fo r g e t t in g ke rn e l s o u rce ca n b e fo u n d a t h t t p : / / www. ke rn e l. o rg . Ma n y m irro r s it e s a re a ls o a va ila b le a ll o ve r t h e wo rld . A va lu a b le s e a rch e n g in e fo r t h e Lin u x 2 . 4 s o u rce co d e is a va ila b le a t h t t p : / / www. t a m a co m . co m / t o u r/ lin u x/ . GCC m a n u a ls All d is t rib u t io n s o f t h e GNU C co m p ile r s h o u ld in clu d e fu ll d o cu m e n t a t io n fo r a ll it s fe a t u re s , s t o re d in s e ve ra l in fo file s t h a t ca n b e re a d wit h t h e Em a cs p ro g ra m o r a n in fo re a d e r. By t h e wa y, t h e in fo rm a t io n o n Ext e n d e d In lin e As s e m b ly is q u it e h a rd t o fo llo w, s in ce it d o e s n o t re fe r t o a n y s p e cific a rch it e ct u re . S o m e p e rt in e n t in fo rm a t io n a b o u t 8 0 x 8 6 GCC's In lin e As s e m b ly a n d g a s , t h e GNU a s s e m b le r in vo ke d b y GCC, ca n b e fo u n d a t : h t t p : / / www. d e lo rie . co m / d jg p p / d o c/ b re n n a n / b re n n a n _ a t t _ in lin e _ d jg p p . h t m l h t t p : / / www. ib m . co m / d e ve lo p e rwo rks / lin u x/ lib ra ry/ l- ia . h t m l h t t p : / / www. g n u . o rg / m a n u a l/ g a s - 2 . 9 . 1 / a s . h t m l

Th e Lin u x Do cu m e n t a t io n Pro je ct Th e we b s it e ( h t t p : / / www. t ld p . o rg ) co n t a in s t h e h o m e p a g e o f t h e Lin u x Do cu m e n t a t io n Pro je ct , wh ich , in t u rn , in clu d e s s e ve ra l in t e re s t in g re fe re n ce s t o g u id e s , FAQs , a n d HOWTOs . Lin u x k e rn e l d e v e lo p m e n t fo ru m Th e n e ws g ro u p co m p . o s . lin u x . d e v e lo p m e n t . s y s t e m is d e d ica t e d t o d is cu s s io n s a b o u t d e ve lo p m e n t o f t h e Lin u x s ys t e m . Th e lin u x - k e rn e l m a ilin g lis t Th is fa s cin a t in g m a ilin g lis t co n t a in s m u ch n o is e a s we ll a s a fe w p e rt in e n t co m m e n t s a b o u t t h e cu rre n t d e ve lo p m e n t ve rs io n o f Lin u x a n d a b o u t t h e ra t io n a le fo r in clu d in g o r n o t in clu d in g in t h e ke rn e l s o m e p ro p o s a ls fo r ch a n g e s . It is a livin g la b o ra t o ry o f n e w id e a s t h a t a re t a kin g s h a p e . Th e n a m e o f t h e m a ilin g lis t is lin u xke rn e l@vg e r. ke rn e l. o rg . Th e Lin u x Ke rn e l o n lin e b o o k Au t h o re d b y Da vid A. Ru s lin g , t h is 2 0 0 - p a g e b o o k ca n b e vie we d a t h t t p : / / www. t ld p . o rg / LDP/ t lk/ t lk. h t m l, a n d d e s crib e s s o m e fu n d a m e n t a l a s p e ct s o f t h e Lin u x 2 . 0 ke rn e l. Lin u x Virt u a l File S y s t e m

Th e p a g e a t h t t p : / / www. a t n f. cs iro . a u / ~ rg o o ch / lin u x/ d o cs / vfs . t xt is a n in t ro d u ct io n t o t h e Lin u x Virt u a l File S ys t e m . Th e a u t h o r is Rich a rd Go o ch . I l@ve Ru Bo a rd

I l@ve Ru Bo a rd

Co lo p h o n Ou r lo o k is t h e re s u lt o f re a d e r co m m e n t s , o u r o wn e xp e rim e n t a t io n , a n d fe e d b a ck fro m d is t rib u t io n ch a n n e ls . Dis t in ct ive co ve rs co m p le m e n t o u r d is t in ct ive a p p ro a ch t o t e ch n ica l t o p ics , b re a t h in g p e rs o n a lit y a n d life in t o p o t e n t ia lly d ry s u b je ct s . Ma ry Bra d y wa s t h e p ro d u ct io n e d it o r a n d co p ye d it o r fo r Un d e rs t a n d in g t h e Lin u x Ke rn e l, S e co n d Ed it io n . An n S ch irm e r wa s t h e p ro o fre a d e r. S a ra h S h e rm a n a n d Cla ire Clo u t ie r p ro vid e d q u a lit y co n t ro l. Ju d y Ho e r a n d Ge n e vie ve d 'En t re m o n t p ro vid e d p ro d u ct io n a s s is t a n ce . Jo h n Bicke lh a u p t wro t e t h e in d e x. Ed ie Fre e d m a n d e s ig n e d t h e co ve r o f t h is b o o k, b a s e d o n a s e rie s d e s ig n b y h e rs e lf a n d Ha n n a Dye r. Th e co ve r im a g e o f a m a n wit h a b u b b le is a 1 9 t h - ce n t u ry e n g ra vin g fro m t h e Do ve r Pict o ria l Arch ive . Em m a Co lb y p ro d u ce d t h e co ve r la yo u t wit h Qu a rkXPre s s 4 . 1 u s in g Ad o b e 's ITC Ga ra m o n d fo n t . Da vid Fu t a t o d e s ig n e d t h e in t e rio r la yo u t . Th e ch a p t e r o p e n in g im a g e s a re fro m t h e Do ve r Pict o ria l Arch ive , Ma rv e ls o f t h e Ne w W e s t : A Viv id Po rt ra y a l o f t h e S t u p e n d o u s Ma rv e ls in t h e Va s t W o n d e rla n d W e s t o f t h e Mis s o u ri Riv e r, b y Willia m Th a ye r ( Th e He n ry Bill Pu b lis h in g Co . , 1 8 8 8 ) , a n d Th e Pio n e e r His t o ry o f Am e rica : A Po p u la r Acco u n t o f t h e He ro e s a n d Ad v e n t u re s , b y Au g u s t u s Lyn ch Ma s o n , A. M. ( Th e Jo n e s Bro t h e rs Pu b lis h in g Co m p a n y, 1 8 8 4 ) . Th is b o o k wa s co n ve rt e d t o Fra m e Ma ke r 5 . 5 . 6 wit h a fo rm a t co n ve rs io n t o o l cre a t e d b y Erik Ra y, Ja s o n McIn t o s h , Ne il Wa lls , a n d Mike S ie rra t h a t u s e s Pe rl a n d XML t e ch n o lo g ie s . Th e t e xt fo n t is Lin o t yp e Birka ; t h e h e a d in g fo n t is Ad o b e Myria d Co n d e n s e d ; a n d t h e co d e fo n t is Lu ca s Fo n t 's Th e S a n s Mo n o Co n d e n s e d . Th e illu s t ra t io n s t h a t a p p e a r in t h e b o o k we re p ro d u ce d b y Ro b e rt Ro m a n o a n d Je s s a m yn Re a d u s in g Ma cro m e d ia Fre e Ha n d 9 a n d Ad o b e Ph o t o s h o p 6 . Th e o n lin e e d it io n o f t h is b o o k wa s cre a t e d b y t h e S a fa ri p ro d u ct io n g ro u p ( Jo h n Ch o d a cki, Be cki Ma is ch , a n d Ma d e le in e Ne we ll) u s in g a s e t o f Fra m e - t o - XML co n ve rs io n a n d cle a n u p t o o ls writ t e n a n d m a in t a in e d b y Erik Ra y, Be n n S a lt e r, Jo h n Ch o d a cki, a n d Je ff Lig g e t t .

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z]

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] . ( p e rio d ) a n d . . ( d o u b le p e rio d ) n o t a t io n 8 0 x 8 6 p ro ce s s o rs clo cks e xce p t io n s I/ O a rch it e ct u re m e m o ry Ta s k S t a t e S e g m e n t 8 0 2 . 3 s t a n d a rd

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] a b o rt s a b s o lu t e p a t h n a m e s Acce le ra t e d Gra p h ics Po rt ( AGP) a cce s s rig h t s ACLs ( a cce s s co n t ro l lis t s ) a ct ive lis t s a ct ive s wa p a re a s a ct ive t ra n s a ct io n s a d d re s s b u s e s a d d re s s re s o lu t io n Ad d re s s Re s o lu t io n Pro t o co l ( ARP) a d d re s s s p a ce s 2 n d cre a t in g d e le t in g a d d re s s _ s p a ce o b je ct s p a g e d e s crip t o rs a d d re s s e s a d ja ce n t b yt e s Ad va n ce d Po we r Ma n a g e m e n t ( APM) Ad va n ce d Pro g ra m m a b le In t e rru p t Co n t ro lle rs [ S e e APICs ] a d vis o ry file lo cks AGP ( Acce le ra t e d Gra p h ics Po rt ) a lig n m e n t fa ct o rs a n o n ym o u s m a p p in g APICs ( Ad va n ce d Pro g ra m m a b le In t e rru p t Co n t ro lle rs ) 2 n d CPU lo ca l t im e r lo ca l, in t e rru p t h a n d le rs fo r t im e rs , s yn ch ro n iza t io n APM ( Ad va n ce d Po we r Ma n a g e m e n t ) a rch d ire ct o ry ARP ( Ad d re s s Re s o lu t io n Pro t o co l) a rp ca ch e As s e m b le r OUTp u t e xe cu t a b le fo rm a t ( a . o u t ) a s yn ch ro n o u s b u ffe r h e a d s a s yn ch ro n o u s in t e rru p t s a s yn ch ro n o u s n o t ifica t io n s a s yn ch ro n o u s re a d - a h e a d o p e ra t io n s 2 n d a t o m ic o p e ra t io n h a n d le s 2 n d a t o m ic o p e ra t io n s 2 n d AVL t re e

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] b a s e p rio rit y b a s e t im e q u a n t a b a t ch p ro ce s s e s b d flu s h ke rn e l t h re a d b ig ke rn e l lo cks ( BKLs ) b ig re a d e r re a d / writ e s p in lo cks BIOS b o o t s t ra p p ro ce s s Re a l Mo d e a d d re s s in g , u s a g e o f b it m a p b it m a p ca ch e s b lo ck clu s t e rin g b lo ck d e vice d e s crip t o rs b lo ck d e vice d rive rs 2 n d [ S e e a ls o I/ O d e vice s ] a rch it e ct u re lo w- le ve l d rive r d e s crip t o rs re q u e s t d e s crip t o rs re q u e s t q u e u e d e s crip t o rs b lo ck I/ O o p e ra t io n s b lo cks b u ffe rs b u ffe r h e a d s d a t a s t ru ct u re s fo r d e fa u lt file o p e ra t io n m e t h o d s in it ia lizin g ke rn e l, m o n it o rin g b y lo w- le ve l re q u e s t h a n d lin g p a g e I/ O o p e ra t io n s re q u e s t in g fu n ct io n s e ct o rs b lo ck d e vice file s p re p a re _ writ e a n d co m m it _ writ e m e t h o d s b lo ck d e vice in o d e b lo ck d e vice re q u e s t b lo ck fra g m e n t a t io n b lo ck g ro u p b lo ck I/ O o p e ra t io n b lo cke d s ig n a ls m o d ifyin g b lo cks file t yp e s , u s a g e b y p re a llo ca t io n b o o t s t ra p b o t t o m h a lve s 2 n d 3 rd im m in e n t o b s o le s ce n ce o f TIMER_ BH b o t t o m h a lf BS D s o cke t s m e thods bss se gm e nts

b u d d y s ys t e m a lg o rit h m b lo cks , a llo ca t io n o f fre e in g o f d a t a s t ru ct u re s e xa m p le s la b a llo ca t o r a n d b u ffe r ca ch e s 2 n d b d flu s h ke rn e l t h re a d b u ffe r h e a d d a t a s t ru ct u re s b u ffe r p a g e s d irt y b u ffe rs , flu s h in g t o d is k d irt y b u ffe rs , writ in g t o d is k g e t _ b lk fu n ct io n I/ O o p e ra t io n s , u s a g e b y ku p d a t e ke rn e l t h re a d b u ffe r h e a d s fo r ca ch e d b u ffe rs u n u s e d b u ffe r h e a d s b u ffe r p a g e s b u ffe rs b u s a d d re s s e s b u s m o u s e in t e rfa ce buses Bu s y b it

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] ca ch e co n t ro lle rs writ e - t h ro u g h a n d writ e - b a ck ca ch e d e s crip t o rs ca ch e e n t ry t a g s ca ch e h it s ca ch e lin e s ca ch e m is s e s ca ch e s n o o p in g ca ch e s 2 n d a llo ca t in g s la b s t o s la b s , re le a s in g fro m t yp e s o f ch a ra ct e r d e vice d rive rs ch a ra ct e r d e vice file s ch ild file s ys t e m s clo cks Co d e S e g m e n t De s crip t o rs co d e s e g m e n t re g is t e rs Co lu m b u s Un ix co m m a n d - lin e a rg u m e n t s 2 n d co m m it _ writ e m e t h o d co m m o n file m o d e l o b je ct t yp e s co m p le t io n s 2 n d co n cu rre n cy le ve l co n t e xt s wit ch co n t ro l b u s e s co n t ro l re g is t e rs 2 n d Co p y On Writ e [ S e e COW] co re d u m p COW ( Co p y On Writ e ) CPL ( Cu rre n t Privile g e Le ve l) 2 n d 3 rd s e g m e n t u p d a t in g a n d CPU lo ca l t im e r CPU re s o u rce lim it CPU- b o u n d p ro ce s s e s cr0 co n t ro l re g is t e r cr3 co n t ro l re g is t e r crit ica l re g io n s 2 n d cu rre n t m a cro Cu rre n t Privile g e Le ve l [ S e e CPL] cu rre n t wo rkin g d ire ct o ry cu s t o m I/ O in t e rfa ce s 2 n d

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D ] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] da ta buse s Da t a S e g m e n t De s crip t o rs d a t a s e g m e n t re g is t e rs d a t e s , u p d a t in g b y t h e ke rn e l d e a d lo cke d s t a t e d e fa u lt ro u t e rs d e fe ct ive p a g e s lo t s d e fe rra b le fu n ct io n s a ct iva t io n o f d is a b lin g e xe cu t io n o f in it ia liza t io n o f o p e ra t io n s p e rfo rm e d o n d e m a n d p a g in g 2 n d lim it a t io n s fo r m e m o ry m a p p in g d e n t ry ca ch e s o b je ct s 2 n d o p e ra t io n s d e p e n d e n cie s De s crip t o r Privile g e Le ve l ( DPL) 2 n d d e s t in a t io n ca ch e s d e vfs d e vice file s 2 n d d e vice co n t ro lle rs d e vice d rive rs 2 n d 3 rd [ S e e a ls o I/ O d e vice s ] b u ffe rin g s t ra t e g ie s IRQ- co n fig u ra t io n re g is t e rin g re s o u rce fu n ct io n s d e vice file s e xa m p le s Virt u a l File s ys t e m , h a n d lin g b y d e vice p lu g g in g d e vice u n p lu g g in g d ire ct a cce s s b u ffe rs d ire ct I/ O t ra n s fe rs d ire ct a cce s s b u ffe rs d ire ct m a p p e d ca ch e s Dire ct Me m o ry Acce s s Co n t ro lle r [ S e e DMAC] d ire ct o rie s d irt y b u ffe rs , writ in g t o d is k d is k ca ch e s 2 n d b u ffe r ca ch e s [ S e e b u ffe r ca ch e s ] p a g e ca ch e s [ S e e p a g e ca ch e s ] d is k co n t ro lle rs d is k in t e rfa ce d is k- b a s e d file s ys t e m s writ e m e t h o d s o f

d is p la ce m e n t o f a lo g ica l a d d re s s DMAC ( Dire ct Me m o ry Acce s s Co n t ro lle r) d o u b ly lin ke d lis t s ru n q u e u e s wa it q u e u e s DPL ( De s crip t o r Privile g e Le ve l) 2 n d d yn a m ic a d d re s s ch e ckin g , e xce p t io n t a b le s g e n e ra t in g d yn a m ic ca ch in g m o d e d yn a m ic d is t rib u t io n o f IRQs d yn a m ic m e m o ry d yn a m ic p rio rit y d yn a m ic t im e rs e xa m p le h a n d lin g ra ce co n d it io n s a n d

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] e 2 fs ck e xt e rn a l p ro g ra m e le va t o r a lg o rit h m ELF ( Exe cu t a b le a n d Lin kin g Fo rm a t ) e m u la t io n e n viro n m e n t va ria b le s 2 n d e p o ch s e rrn o va ria b le ES CAPE in s t ru ct io n s e s p re g is t e rs Et h e rn e t e xce p t io n h a n d lin g e xce p t io n h a n d le rs d o wn ( ) a n d e n t e rin g a n d le a vin g n e s t e d e xe cu t io n o f e xce p t io n t a b le s g e n e ra t in g e xce p t io n s 2 n d 3 rd h a rd wa re h a n d lin g o f p ro ce s s s wit ch in g , co n t ra s t e d wit h t e rm in a t io n p h a s e t yp e s o f e xclu s ive p ro ce s s e s Exe cu t a b le a n d Lin kin g Fo rm a t ( ELF) e xe cu t a b le file s 2 n d Exe cu t e a cce s s rig h t s e xe cu t io n co n t e xt e xe cu t io n d o m a in d e s crip t o rs e xit ( ) lib ra ry fu n ct io n Ext FS ( Ext e n d e d File s ys t e m ) Ext 2 ( S e co n d Ext e n d e d File s ys t e m ) b it m a p b it m a p ca ch e s b lo ck g ro u p s b lo cks , u s a g e b y file t yp e s cre a t in g d a t a b lo cks a d d re s s in g a llo ca t in g file h o le s re le a s in g d e vice file s , p ip e s , a n d s o cke t s d ire ct o rie s d is k d a t a s t ru ct u re s d is k s p a ce m a n a g e m e n t fe a t u re s g ro u p d e s crip t o rs 2 n d in o d e t a b le s in o d e s

cre a t in g d e le t in g m e m o ry d a t a s t ru ct u re s file s ys t e m im a g e s RAM co p y m e thods file o p e ra t io n s in o d e o p e ra t io n s s u p e rb lo ck o p e ra t io n s p a rt it io n b o o t s e ct o r p re a llo ca t io n o f b lo cks re g u la r file s s u p e rb lo cks 2 n d s ym b o lic lin ks VFS s u p e rb lo ck d a t a Ext 3 file s ys t e m a t o m ic o p e ra t io n h a n d le s jo u rn a lin g jo u rn a lin g b lo ck d e vice la ye r jo u rn a lin g file s ys t e m lo g g in g m e ta da ta t ra n s a ct io n s Ext e n d e d File s ys t e m ( Ext FS ) e xt e n d e d fra m e s e xt e n d e d p a g in g e xt e rn a l d e vice e xt e rn a l fra g m e n t a t io n e xt e rn a l o b je ct d e s crip t o rs e xt e rn a l s la b d e s crip t o rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] f0 0 f b u g fa u lt s FIB ( Fo rwa rd in g In fo rm a t io n Ba s e ) FIFOs cre a t in g a n d o p e n in g file o p e ra t io n s p ip e s , co n t ra s t e d wit h file b lo ck n u m b e rs file co n t ro l b lo cks file d e s crip t o rs 2 n d file h a n d le s file h a rd lin ks file h o le s file lo ckin g file m o d e s file o b je ct s 2 n d file o p e ra t io n s 2 n d file p o in t e rs 2 n d file s o ft lin ks file s a cce s s e s a n d m e m o ry- m a p p in g a cce s s in g 2 n d d ire ct I/ O t ra n s fe rs [ S e e d ire ct I/ O t ra n s fe rs ] m e m o ry m a p p in g [ S e e m e m o ry m a p p in g ] a d d re s s in g o f clo s in g d e le t in g d e vice file s file n a m e le n g t h fra g m e n t a t io n o p e n in g re a d in g fro m p a g e b a s is o f re a d o p e ra t io n d e s crip t o rs re a d - a h e a d t e ch n iq u e [ S e e re a d - a h e a d o f file s ] re n a m in g u n d e le t io n o f writ in g t o 2 n d p re p a re _ writ e a n d co m m it _ writ e m e t h o d s writ e m e t h o d s , d is k- b a s e d file s ys t e m s file s ys t e m co n t ro l b lo cks file s ys t e m t yp e re g is t ra t io n file s ys t e m s 2 n d 3 rd [ S e e a ls o Virt u a l File s ys t e m ] Ext 2 [ S e e Ext 2 ] Ext 3 [ S e e Ext 3 file s ys t e m ] m o u n t in g g e n e ric file s ys t e m s ro o t file s ys t e m s p e cia l file s ys t e m s

t yp e s Un ix file s ys t e m u n m o u n t in g file t yp e s fix- m a p p e d lin e a r a d d re s s e s fixe d p re e m p t io n p o in t fixe d - lim it ca ch in g m o d e flo a t in g - p o in t u n it ( FPU) flo p p y d is ks , fo rm a t t in g fo r Lin u x flu s h in g d irt y b u ffe rs s ys t e m ca lls fo r fo cu s p ro ce s s o rs Fo rwa rd in g In fo rm a t io n Ba s e ( FIB) FPU ( flo a t in g - p o in t u n it ) fra m e b u ffe r fra m e s 2 n d e xt e n d e d fra m e s fre e p a g e fra m e fu ll- d u p le x p ip e fu lly a s s o cia t ive ca ch e s fu n ct io n fo o t p rin t s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] G g ra n u la rit y fla g s GART ( Gra p h ics Ad d re s s Re m a p p in g Ta b le ) GDT ( Glo b a l De s crip t o r Ta b le ) g e n e ra l ca ch e s g e n e ra l- p u rp o s e I/ O in t e rfa ce s 2 n d GID ( Gro u p ID) Glo b a l De s crip t o r Ta b le ( GDT) g lo b a l in t e rru p t d is a b lin g co n cu rre n cy a n d g lo b a l ke rn e l lo cks GNU/ Lin u x ke rn e l vs . co m m e rcia l d is t rib u t io n s goa l g ra p h ic in t e rfa ce Gra p h ics Ad d re s s Re m a p p in g Ta b le ( GART) Gro u p ID ( GID) 2 n d g ro u p le a d e r

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] h a lf- d u p le x p ip e s h a rd wa re ca ch e m e m o ry h a rd wa re ca ch e s h a n d lin g lin e s h a rd wa re clo cks h a rd wa re co n t e xt h a rd wa re co n t e xt s wit ch e s h a rd wa re e rro r co d e s h a rd wa re h e a d e r ca ch e h a s h ch a in in g h a s h co llis io n heaps 2nd m a n a g in g h id d e n s ch e d u lin g h ig h - le ve l d rive r h o s t id e n t ifie rs hot spot h yp e r- t h re a d e d m icro p ro ce s s o rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I ] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] I/ O APIC ( Ad va n ce d Pro g ra m m a b le In t e rru p t Co n t ro lle r) in it ia liza t io n a t b o o t s t ra p I/ O a rch it e ct u re I/ O b u s e s I/ O d e vice s b lo ck d e vice d rive rs [ S e e b lo ck d e vice d rive rs ] ch a ra ct e r d e vice d rive rs d e vice co n t ro lle rs d e vice d rive rs b u ffe rin g s t ra t e g ie s re g is t e rin g re s o u rce s d e vice file s DMAC ( Dire ct Me m o ry Acce s s Co n t ro lle r) I/ O in t e rfa ce s I/ O o p e ra t io n s , m o n it o rin g I/ O p o rt s I/ O s h a re d m e m o ry a cce s s in g a d d re s s m a p p in g ke rn e l, le ve ls o f s u p p o rt b y I/ O in t e rru p t h a n d le rs I/ O in t e rru p t h a n d lin g I/ O- b o u n d p ro ce s s e s id t r CPU re g is t e rs IDTs ( In t e rru p t De s crip t o r Ta b le s ) in it ia lizin g p re lim in a ry in it ia liza t io n IEEE 8 0 2 s t a n d a rd s im m u t a b le file s in a ct ive lis t s in clu d e d ire ct o ry INET s o cke t s m e thods in fo s t ru ct u re in it p ro ce s s 2 n d in it ia lize d d a t a s e g m e n t s in o d e o b je ct s 2 n d in o d e o p e ra t io n s in o d e s e m a p h o re s in o d e t a b le s in o d e s ca ch e s n u m b e rs in p u t re g is t e rs in s m o d p ro g ra m in t in s t ru ct io n in t e ra ct ive p ro ce s s e s in t e rn a l d e vice

in t e rn a l fra g m e n t a t io n in t e rn a l o b je ct d e s crip t o rs in t e rn a l s la b d e s crip t o rs In t e rn e t In t e rn e t Pro t o co l S u it e ( IPS ) n e t wo rk a rch it e ct u re in t e rp re t e d s crip t s in t e rp ro ce s s co m m u n ica t io n s 2 n d FIFOs cre a t in g a n d o p e n in g file o p e ra t io n s p ip e s cre a t in g a n d d e s t ro yin g d a t a s t ru ct u re s lim it a t io n s o f re a d a n d writ e ch a n n e ls re a d in g fro m writ in g in t o S ys t e m V IPC [ S e e S ys t e m V IPC] Un ix, m e ch a n is m s a va ila b le in in t e rp ro ce s s o r in t e rru p t s In t e rru p t Co n t ro lle rs In t e rru p t De s crip t o r Ta b le s [ S e e IDTs ] in t e rru p t d e s crip t o rs in t e rru p t g a t e s 2 n d 3 rd in t e rru p t h a n d lin g in t e rru p t h a n d le rs fo r lo ca l t im e rs n e s t e d e xe cu t io n o f re g is t e rs , s a vin g vs e xce p t io n h a n d le rs in t e rru p t m o d e In t e rru p t Re d ire ct io n Ta b le s In t e rru p t Re Qu e s t s [ S e e IRQs ] in t e rru p t s e rvice ro u t in e s ( IS Rs ) 2 n d in t e rru p t s ig n a ls 2 n d in t e rru p t ve ct o rs in t e rru p t s a ct io n s fo llo win g b o t t o m h a lve s d is a b lin g 2 n d h a rd wa re h a n d lin g o f IRQs ( In t e rru p t Re Qu e s t s ) a n d la p t o p s a n d m u lt ip ro ce s s o r s ys t e m s , h a n d lin g o n n u m e rica l id e n t ifica t io n p ro ce s s s wit ch in g , co n t ra s t e d wit h t e rm in a t io n p h a s e t o p h a lve s t yp e s o f ve ct o rs in t e rva l t im e rs IPC [ S e e S ys t e m V IPC] [ S e e a ls o in t e rp ro ce s s co m m u n ica t io n s ] 2 n d [ S e e S ys t e m V IPC] IPS ( In t e rn e t Pro t o co l S u it e ) n e t wo rk a rch it e ct u re IRQs ( In t e rru p t Re Qu e s t s )

a llo ca t io n o f IRQ lin e s d a t a s t ru ct u re s I/ O APIC a n d lin e s e le ct io n , IRQ co n fig u ra b le d e vice s IS A b u s e s , m e m o ry m a p p in g IS Rs ( in t e rru p t s e rvice ro u t in e s ) 2 n d 3 rd

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] JBD ( Jo u rn a lin g Blo ck De vice ) jiffie s t im e r im p le m e n t a t io n a n d jo u rn a l jo u rn a lin g 2 n d Jo u rn a lin g Blo ck De vice ( JBD) jo u rn a lin g b lo ck d e vice la ye r jo u rn a lin g file s ys t e m s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] ka p m ke rn e l co d e s e g m e n t ke rn e l co n t ro l p a t h s 2 n d 3 rd in t e rle a vin g co n d it io n s Lin u x, in t e rle a vin g in ra ce co n d it io n s a n d ke rn e l d a t a s e g m e n t ke rn e l m a s t e r Pa g e Glo b a l Dire ct o ry Ke rn e l Me m o ry Allo ca t o r ( KMA) Ke rn e l Mo d e 2 n d 3 rd e xce p t io n s in Us e r Mo d e , co n t ra s t e d wit h ke rn e l o o p s 2 n d ke rn e l p a g e ca ch e s e lf- ca ch in g a p p lica t io n s a n d ke rn e l p a g e t a b le s ke rn e l re q u e s t s , is s u in g o f ke rn e l s e m a p h o re s a cq u irin g re le a s in g ke rn e l s ym b o l t a b le ke rn e l t h re a d s 2 n d ke rn e l wra p p e r ro u t in e s 2 n d ke rn e ls 2 n d 3 rd co d e p ro file rs co n cu rre n cy a n d g lo b a l in t e rru p t d is a b lin g co n cu rre n cy le ve l CPU a ct ivit y, t ra ckin g b y d a t a s t ru ct u re s , s yn ch ro n iza t io n o f a cce s s d e s t in a t io n ca ch e s GNU/ Lin u x vs . co m m e rcia l d is t rib u t io n s in t e rp ro ce s s co m m u n ica t io n s [ S e e a ls o in t e rp ro ce s s co m m u n ica t io n s ] 2 n d in t e rru p t h a n d lin g ke rn e l t h re a d s Lin u x co m p a re d t o Un ix lo a d in g a n d e xe cu t io n m a p p in g s , h ig h - m e m o ry p a g e fra m e s m o d u le s n o n p re e m p t ive p re e m p t ive vs . n o n p re e m p t ive p rio rit y o f p ro ce s s m a n a g e m e n t p ro ce s s e s , co n t ra s t e d wit h re a d / writ e s e m a p h o re s , h a n d lin g o f s ig n a ls usa ge of s o u rce co d e a n d in s t ru ct io n o rd e r s o u rce co d e d ire ct o ry t re e ( Lin u x)

s yn ch ro n iza t io n co n d it io n s n o t re q u irin g t e ch n iq u e s [ S e e s yn ch ro n iza t io n p rim it ive s ] t h re a d s , m e m o ry d e s crip t o rs o f t im e ke e p in g Un ix ke rn e ls d e m a n d p a g in g d e vice d rive rs re e n t ra n t ke rn e ls ke yb o a rd in t e rfa ce KMA ( Ke rn e l Me m o ry Allo ca t o r) ks wa p d ke rn e l t h re a d s 2 n d ku p d a t e ke rn e l t h re a d s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] L1 - ca ch e s L2 - ca ch e s la zy TLB m o d e ld . s o LDTDs ( Lo ca l De s crip t o r Ta b le De s crip t o rs ) LDTs ( Lo ca l De s crip t o r Ta b le s ) 2 n d 3 rd le a s e lo cks Le a s t Re ce n t ly Us e d ( LRU) lis t s [ S e e LRU lis t s ] le ft ch ild re n , re d - b la ck t re e s lig h t we ig h t p ro ce s s e s 2 n d cre a t io n in Lin u x lin e a r a d d re s s fie ld s lin e a r a d d re s s in t e rva ls a llo ca t in g re le a s in g m e m o ry re g io n s , s ca n n in g Pa g e Ta b le s , u p d a t in g lin e a r a d d re s s e s a n d n o n co n t ig u o u s m e m o ry a re a s lin ks Lin u x a d va n t a g e s e m u la t io n o f o t h e r o p e ra t in g s ys t e m s file s ys t e m s Un ix file s ys t e m a n d h a rd wa re d e p e n d e n cy ke rn e l ke rn e l co n t ro l p a t h s , in t e rle a vin g ke rn e l t h re a d in g lig h t we ig h t p ro ce s s e s , re lia n ce o n m e m o ry b a rrie rs p a g in g p la t fo rm s POS IX co m p lia n ce s e g m e n t a t io n se gm e nts use d s o u rce co d e [ S e e s o u rce co d e ] t im e ke e p in g [ S e e t im e ke e p in g a rch it e ct u re ] Un ix ke rn e l a n d ve rs io n n u m b e rin g Lin u x ke rn e ls [ S e e ke rn e ls ] Lin u xTh re a d s lib ra ry lo ca l APICs a rb it ra t io n in t e rru p t h a n d le rs Lo ca l De s crip t o r Ta b le De s crip t o rs ( LDTDs ) Lo ca l De s crip t o r Ta b le s ( LDTs ) 2 n d lo ca l in t e rru p t s , d is a b lin g lo ca l TLB

lo ca lit y p rin cip le lo ckin g lo cks , g lo b a l ke rn e l lo g re co rd lo g ica l a d d re s s e s lo g ica l b lo ck n u m b e r lo g in n a m e lo g in s e s s io n s lo o p b a ck lo w- le ve l d rive r lo w- le ve l d rive r d e s crip t o r LRU ( Le a s t Re ce n t ly Us e d ) lis t s p a g e s , m o vin g a cro s s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] m a g ic s t ru ct u re m a jo r fa u lt s m a jo r n u m b e rs m a n d a t o ry file lo cks m a s ka b le in t e rru p t s m a s ke d s ig n a ls m a s kin g o f d e fe rra b le fu n ct io n s Ma s t e r Bo o t Re co rd ( MBR) m a s t e r ke rn e l Pa g e Glo b a l Dire ct o ry m a s t e r m e m o ry d e s crip t o r m a t h e m a t ica l co p ro ce s s o rs MBR ( Ma s t e r Bo o t Re co rd ) m e m o ry in it ia liza t io n o f d a t a s t ru ct u re s fo r m anagem ent b u d d y s ys t e m a lg o rit h m p a g e fra m e s p e rm a n e n t ke rn e l m a p p in g s s wa p p in g [ S e e s wa p p in g ] t e m p o ra ry ke rn e l m a p p in g s m e m o ry a d d re s s e s m e m o ry a d d re s s in g m e m o ry a lig n m e n t m e m o ry a llo ca t io n a n d d e m a n d p a g in g m e m o ry a rb it e rs 2 n d 3 rd m e m o ry a re a d e s crip t o rs m e m o ry a re a m a n a g e m e n t ca ch e d e s crip t o rs ca ch e s in t e rfa ce , s la b a llo ca t o r a n d b u d d y s ys t e m a lg o rit h m m u lt ip ro ce s s o r s ys t e m s n o n co n t ig u o u s a re a s [ S e e n o n co n t ig u o u s m e m o ry a re a m a n a g e m e n t ] o b je ct d e s crip t o rs a lig n in g o b je ct s in m e m o ry s la b a llo ca t o rs s la b co lo rin g s la b d e s crip t o rs s la b s a llo ca t in g t o ca ch e s re le a s in g fro m ca ch e s m e m o ry b a rrie rs m e m o ry d e s crip t o rs fie ld s o f ke rn e l t h re a d s m m a p _ ca ch e re a d / writ e s e m a p h o re s re d - b la ck t re e s m e m o ry fra g m e n t a t io n Me m o ry Ma n a g e m e n t Un it ( MMU)

m e m o ry m a p p in g 2 n d cre a t in g d a t a s t ru ct u re s d e m a n d p a g in g fo r d e s t ro yin g flu s h in g d irt y p a g e s t o d is k m e m o ry n o d e s m e m o ry re g io n s 2 n d 3 rd a cce s s rig h t s a s s ig n m e n t t o p ro ce s s e s d a t a s t ru ct u re s fie ld s fla g s h a n d lin g fin d in g a fre e in t e rva l fin d in g a re g io n t h a t o ve la p s a n in t e rva l fin d in g t h e clo s e s t re g io n t o a n a d d re s s in s e rt in g a re g io n in t h e m e m o ry d e s crip t o r lis t lin e a r a d d re s s in t e rva ls m e rg in g p a g e s , re la t io n t o s ys t e m ca lls fo r cre a t io n , d e le t io n m e m o ry s wa p p in g [ S e e s wa p p in g ] m e m o ry zo n e s m essage queues m e ta da ta m icro ke rn e ls m icro p ro ce s s o rs , h yp e r- t h re a d e d m in o r fa u lt s m in o r n u m b e rs m ke 2 fs u t ilit y p ro g ra m m ks wa p co m m a n d MMU ( Me m o ry Ma n a g e m e n t Un it ) MMX in s t ru ct io n s m o d p ro b e p ro g ra m m o d u le s 2 n d 3 rd a d va n t a g e s d a t a s t ru ct u re s a n d d e p e n d e n cie s e xce p t io n t a b le s e xp o rt in g o f s ym b o ls im p le m e n t a t io n lice n s e s lin kin g a n d u n lin kin g lin kin g o n d e m a n d m o d u le o b je ct s m o d u le u s a g e co u n t e rs re q u e s t _ m o d u le fu n ct io n m o u n t p o in t s m o u n t e d file s ys t e m d e s crip t o rs m u lt ip ro ce s s in g m u lt ip ro ce s s o r s ys t e m s ca ch e s a n d in t e rru p t d is a b lin g a n d

in t e rru p t h a n d lin g o n m e m o ry a n d m e m o ry a re a m a n a g e m e n t n o n p re e m p t ive ke rn e ls a n d p ro ce s s e s , s ch e d u lin g o n t im e ke e p in g a rch it e ct u re in it ia liza t io n m u lt ip ro g ra m m in g m u lt it h re a d e d a p p lica t io n s 2 n d m u lt iu s e r s ys t e m s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N ] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] N- wa y s e t a s s o cia t ive ca ch e s n a m e d p ip e s n e ig h b o r ca ch e s Ne t - 4 n e t wo rk a d d re s s e s n e t wo rk file s ys t e m s n e t wo rk in t e rfa ce n e t wo rk in t e rfa ce ca rd s ( NICs ) n e t wo rk m a s ks n e t wo rkin g d a t a s t ru ct u re s BS D s o cke t s d e s t in a t io n ca ch e s Fo rwa rd in g In fo rm a t io n Ba s e ( FIB) INET s o cke t s n e ig h b o r ca ch e s NICs ( n e t wo rk in t e rfa ce ca rd s ) ro u t in g ca ch e s s o cke t b u ffe rs fra m e s IP ( In t e rn e t Pro t o co l la ye r) n e t wo rk a rch it e ct u re s Lin u x, s u p p o rt e d b y n e t wo rk ca rd s 2 n d re ce ivin g p a cke t s fro m s e n d in g p a cke t s t o n e t wo rk la ye rs 2 n d p a ylo a d p ro g ra m m in g fo r n e t wo rks p ro t o co ls s o cke t s in it ia liza t io n s t a t ic ro u t in g t a b le s ys t e m ca lls re la t e d t o d a t a lin k la ye r n e t wo rk la ye rs t ra n s p o rt la ye r zo n e s NGPT ( Ne xt Ge n e ra t io n Po s ix Th re a d in g Pa cka g e ) NMI in t e rru p t s n o d e s , re d - b la ck t re e s n o n co n t ig u o u s m e m o ry a re a m a n a g e m e n t a llo ca t in g n o n co n t ig u o u s a re a d e s crip t o rs lin e a r a d d re s s e s Pa g e Fa u lt e xce p t io n h a n d le rs a n d Pa g e Fa u lt s a n d re le a s in g m e m o ry a re a n o n e xclu s ive p ro ce s s e s

n o n m a s ka b le in t e rru p t s n o n p re e m p t ive ke rn e ls m u lt ip ro ce s s o r s ys t e m s a n d n o n p re e m p t ive p ro ce s s e s NUMA ( No n - Un ifo rm Me m o ry Acce s s ) 2 n d node s n o d e s d e s crip t o rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] o b je ct d e s crip t o rs o b je ct file s o b je ct s ca ch e s , a llo ca t in g in m u lt ip ro ce s s o rs u n ip ro ce s s o rs ca ch e s , re le a s in g fro m m u lt ip ro ce s s o rs u n ip ro ce s s o rs g e n e ra l p u rp o s e o ffs e t s , o f lo g ica l a d d re s s e s o ld - s t yle d e vice file s 2 n d o ld e r s ib lin g s o p e ra t in g s ys t e m s e xe cu t io n m o d e s o u t p u t re g is t e rs o wn e rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P ] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] PAEs ( Ph ys ica l Ad d re s s Ext e n s io n s ) 2 n d Pa g e Ca ch e Dis a b le ( PCD) fla g p a g e ca ch e s 2 n d a d d re s s _ s p a ce o b je ct s d a t a s t ru ct u re s d ire ct I/ O t ra n s fe rs , b yp a s s in g wit h h a n d lin g fu n ct io n s I/ O o p e ra t io n s , u s a g e b y p a g e d e s crip t o r fie ld s p a g e h a s h t a b le s p a g e d e s crip t o r lis t s p a g e d e s crip t o rs Pa g e Dire ct o rie s Pa g e Fa u lt e xce p t io n s 2 n d Pa g e Fa u lt e xce p t io n h a n d le rs Co p y On Writ e d e m a n d p a g in g fa u lt y a d d re s s e s in s id e a d d re s s s p a ce , h a n d lin g o u t s id e a d d re s s s p a ce , h a n d lin g n o n co n t ig u o u s m e m o ry a re a a cce s s e s , h a n d lin g p ro ce s s flo w Pa g e Fa u lt s , n o n co n t ig u o u s m e m o ry a re a s a n d p a g e fra m e re cla im in g a lg o rit h m fro m t h e d e n t ry ca ch e fu n ct io n s fro m t h e in o d e ca ch e ks wa p d ke rn e l t h re a d s Le a s t Re ce n t ly Us e d ( LRU) lis t s p a g e s , m o vin g a cro s s p u rp o s e p a g e fra m e s 2 n d a vo id in g ra ce co n d it io n s o n fre e p a g e fra m e h ig h - m e m o ry, ke rn e l m a p p in g o f m anagem ent m e m o ry zo n e s p a g e d e s crip t o rs p ro ce s s e s , s h a rin g a m o n g re q u e s t a n d re le a s e o f re s e rve d p a g e I/ O o p e ra t io n p a g e s lo t s d e fe ct ive s lo t s fu n ct io n s fo r a llo ca t io n a n d re le a s e o f Pa g e Ta b le s 2 n d h a n d lin g fo r ke rn e ls

p ro vis io n a l fo r p ro ce s s e s p ro t e ct io n b it s pages LRU lis t s , m o vin g a cro s s m e m o ry re g io n s , re la t io n t o s wa p p in g [ S e e s wa p p in g ] p a g in g d e m a n d p a g in g in h a rd wa re in Lin u x vs . s e g m e n t a t io n p a g in g u n it s p a ra lle l p o rt s p a re n t file s ys t e m s p a s s wo rd p a t h n a m e lo o ku p pa thna m e s p a ylo a d PCD ( Pa g e Ca ch e Dis a b le ) fla g PCI b u s e s , m e m o ry m a p p in g PCMCIA in t e rfa ce s p e n d in g b lo cke d s ig n a ls p e n d in g s ig n a l q u e u e s p e n d in g s ig n a ls Pe n t iu m p ro ce s s o rs ca ch in g f0 0 f b u g t h re e - le ve l p a g in g a n d p e rio d s in d ire ct o ry n o t a t io n p e rm a n e n t ke rn e l m a p p in g s t e m p o ra ry ke rn a l m a p p in g s , co n t ra s t e d wit h p e rs o n a lit ie s Lin u x, s u p p o rt e d b y Ph ys ica l Ad d re s s Ext e n s io n s ( PAEs ) 2 n d p h ys ica l a d d re s s e s p h ys ica l p a g e s PID ( p ro ce s s ID) p ip e b u ffe r p ip e s ize p ip e s cre a t in g a n d d e s t ro yin g d a t a s t ru ct u re s FIFOs , co n t ra s t e d wit h lim it a t io n s o f re a d a n d writ e ch a n n e ls re a d in g fro m writ in g in t o PIT ( Pro g ra m m a b le In t e rva l Tim e r) in t e rru p t s e rvice ro u t in e m u lt ip ro ce s s o r s ys t e m s a n d p o llin g m o d e p o rt s I/ O p o rt s

POS IX ( Po rt a b le Op e ra t in g S ys t e m s b a s e d o n Un ix) s ig n a ls Po we r- On S e lf- Te s t ( POS T) p re e m p t ive ke rn e ls p re e m p t ive p ro ce s s e s 2 n d p re p a re _ writ e m e t h o d p rim it ive s e m a p h o re s p rio rit y in ve rs io n p riva t e m e m o ry m a p p in g p ro ce s s 0 p ro ce s s 1 p ro ce s s ca p a b ilit ie s p ro ce s s cre d e n t ia ls p ro ce s s d e s crip t o rs 2 n d 3 rd h a rd wa re co n t e xt , s a vin g o f m e m o ry, s t o ra g e in p ro ce s s 0 p ro ce s s d e s crip t o r p o in t e rs p ro ce s s lis t s d o u b ly lin ke d lis t s re p re s e n t a t io n p ro ce s s g ro u p ID p ro ce s s g ro u p s p ro ce s s ID ( PID) p ro ce s s p a g e t a b le s p ro ce s s s wit ch e s h a rd wa re co n t e xt in t e rru p t h a n d lin g , co n t ra s t e d wit h ke rn e ls , p e rfo rm a n ce b y p ro ce s s t im e - o u t s p ro ce s s / ke rn e l m o d e l p ro ce s s e s 2 n d 3 rd 4 t h a d d re s s s p a ce s 2 n d cre a t in g d e le t in g fu n ct io n s a n d m a cro s fo r a cce s s in g lin e a r a d d re s s e s ch ild re n 2 n d co m m u n ica t io n b e t we e n [ S e e in t e rp ro ce s s co m m u n ica t io n ] cre a t in g d e s t ro yin g e xe cu t io n d o m a in s , s p e cifica t io n file s a s s o cia t e d wit h file s , re a d in g fro m I/ O- b o u n d o r CPU- b o u n d im p le m e n t a t io n in it ke rn e l s t a ck re p re s e n t a t io n lig h t we ig h t p ro ce s s e s cre a t io n in Lin u x m anagem ent m e m o ry re g io n s , a s s ig n m e n t circu m s t a n ce s m e m o ry re q u e s t s o rig in a l p a re n t s

p a g e fra m e s , s h a rin g o f p a re n t s 2 n d p e rs o n a lit y fie ld s p re e m p t io n o f p ro g ra m e xe cu t io n [ S e e p ro g ra m e xe cu t io n ] q u a n t u m d u ra t io n re m o va l re s o u rce lim it s s ch e d u lin g a lg o rit h m [ S e e s ch e d u lin g a lg o rit h m ] b a s e p rio rit y b a s e t im e q u a n t a d a t a s t ru ct u re s d yn a m ic p rio rit y e p o ch s e va lu a t in g p rio rit y m u lt ip ro ce s s o r s ys t e m s p o licy p rio rit y, a s s ig n m e n t o f 2 n d re a l- t im e p ro ce s s e s , s ys t e m ca lls re la t e d t o s ch e d u le fu n ct io n s t a t ic p rio rit y s ys t e m ca lls re la t e d t o 2 n d s ig n a ls , re s p o n s e t o s le e p in g p ro ce s s e s s u s p e n d in g s wa p p e r t e rm in a t io n t im e s h a rin g t yp e s o f yo u n g e r s ib lin g s zo m b ie s p ro ce s s o r- d e t e ct e d e xce p t io n s p ro file p ro g ra m co u n t e rs p ro g ra m e xe cu t io n co m m a n d - lin e a rg u m e n t s e n viro n m e n t va ria b le s e xe c fu n ct io n s e xe cu t a b le file s e xe cu t a b le fo rm a t s e xe cu t io n d o m a in s lib ra rie s p ro ce s s ca p a b ilit ie s p ro ce s s m e m o ry re g io n s se gm e nts p ro g ra m in t e rp re t e rs Pro g ra m m a b le In t e rva l Tim e r ( PIT) p ro g ra m m e d e xce p t io n s p ro t e ct e d m o d e p ro t o co ls p ro vis io n a l Pa g e Glo b a l Dire ct o ry p t h re a d ( POS IX t h re a d ) lib ra rie s PWT ( Pa g e Writ e - Th ro u g h ) ca ch e

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] q u a n t a 2 n d 3 rd q u a n t u m d u ra t io n

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] ra ce co n d it io n s d yn a m ic t im e rs a n d p re ve n t io n s wa p ca ch e s a n d RAM ( ra n d o m a cce s s m e m o ry) d yn a m ic m e m o ry a n d s wa p p in g [ S e e s wa p p in g ] Un ix, u s a g e in ra n d o m a cce s s m e m o ry [ S e e RAM] Re a d a cce s s rig h t s re a d file lo cks re a d o p e ra t io n d e s crip t o rs re a d - a h e a d a lg o rit h m re a d - a h e a d g ro u p s re a d - a h e a d o f file s a s yn ch ro n o u s re a d - a h e a d o p e ra t io n s 2 n d s yn ch ro n o u s re a d - a h e a d o p e ra t io n s 2 n d re a d - a h e a d win d o ws re a d / writ e s e m a p h o re s re a d / writ e s p in lo cks re a l m o d e Re a l Mo d e a d d re s s e s Re a l Tim e Clo ck ( RTC) re a l- t im e p ro ce s s e s s ys t e m ca lls re la t e d t o re a l- t im e s ig n a ls 2 n d re d - b la ck t re e s re e n t ra n t fu n ct io n s re e n t ra n t ke rn e ls s yn ch ro n iza t io n in t e rru p t d is a b lin g re fe re n ce co u n t e rs re g is t e rin g a d e vice d rive r re g u la r file s re g u la r s ig n a ls re la t ive p a t h n a m e s re q u e s t d e s crip t o rs re q u e s t q u e u e s Re q u e s t o r Privile g e Le ve l re s e rve d p a g e fra m e s re s o u rce re s o u rce lim it s rig h t ch ild re n , re d - b la ck t re e s ro o t ro o t d ire ct o rie s 2 n d ro o t file s ys t e m s 2 n d m o u n t in g ro u t e rs ro u t in g ca ch e s

ro u t in g d a t a s t ru ct u re s ro u t in g zo n e ru n q u e u e s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] S s ys t e m fla g s s ch e d u le fu n ct io n d ire ct in vo ca t io n la zy in vo ca t io n p ro ce s s s wit ch e s a ct io n s p e rfo rm e d a ft e r a ct io n s p e rfo rm e d b e fo re s ch e d u le r s ch e d u lin g a lg o rit h m I/ O- b o u n d p ro ce s s b o o s t in g , e ffe ct ive n e s s o f p e rfo rm a n ce o f re a l- t im e a p p lica t io n s u p p o rt s ca lin g s ys t e m lo a d a n d s ch e d u lin g p o licy S CS Is ( S m a ll Co m p u t e r S ys t e m In t e rfa ce s ) bus s t a n d a rd S e co n d Ext e n d e d File s ys t e m [ S e e Ext 2 ] s e ct o rs S e g m e n t De s crip t o rs S e g m e n t S e le ct o rs s e g m e n t a t io n in Lin u x vs . p a g in g s e g m e n t a t io n re g is t e rs s e g m e n t a t io n u n it s 2 n d se gm e nts CPL ( Cu rre n t Privile g e Le ve l) a n d Lin u x, u s e d in o f lo g ica l a d d re s s e s s e lf- ca ch in g a p p lica t io n s s e m a p h o re s a cq u irin g ke rn e l s e m a p h o re s ra ce co n d it io n s , p re ve n t in g wit h re a d / writ e s e m a p h o re s re le a s in g S ys t e m V IPC s e ria l p o rt s S e t Gro u p ID ( s g id ) S e t Us e r ID ( s u id ) s e t u id p ro g ra m s s g id ( S e t Gro u p ID) s h a re - m o d e m a n d a t o ry lo o ks s h a re d lib ra rie s s h a re d lin ke d lis t s , in s e rt io n o f e le m e n t s in t o s h a re d m e m o ry s h a re d m e m o ry m a p p in g

s h a re d p a g e s wa p p in g s ig n a ls 2 n d b lo cke d s ig n a ls , m o d ifyin g o f b lo ckin g o f ca t ch in g fra m e s , s e t t in g u p s ig n a l fla g s , e va lu a t in g s ig n a l h a n d le rs , s t a rt in g a n d t e rm in a t in g ch a n g in g t h e a ct io n o f d a t a s t ru ct u re s o p e ra t io n s o n d e fa u lt a ct io n s d e live rin g 2 n d d e s crip t o rs e xce p t io n h a n d le rs e xe cu t in g d e fa u lt a ct io n s fo r fo rcin g fu n ct io n s g e n e ra t in g 2 n d ig n o rin g Lin u x 2 . 4 , firs t 3 1 in m a s kin g o f p e n d in g b lo cke d s ig n a ls , e xa m in in g p e n d in g s ig n a l q u e u e s p e n d in g s ig n a ls p h a s e s o f t ra n s m is s io n p ro ce s s d e s crip t o r fie ld s fo r h a n d lin g p ro ce s s e s , re s p o n s e o f p ro ce s s e s , s u s p e n d in g p u rp o s e re a l- t im e s ig n a ls re a l- t im e s ig n a ls , s ys t e m ca lls fo r re g u la r s ig n a ls s e n d e r co d e s s e n d in g fu n ct io n s S IG p re fix S IGKILL S IGS TOP s ys t e m ca lls fo r h a n d lin g o f re e xe cu t in g s la b a llo ca t o rs b u d d y s ys t e m a lg o rit h m , in t e rfa cin g wit h s la b ca ch e lis t s e m a p h o re s s la b ca ch e s s la b co lo rin g s la b d e s crip t o rs s la b o b je ct co n s t ru ct o rs s la b o b je ct d e s t ru ct o rs s la b o b je ct s s la b s ca ch e s , a llo ca t in g t o re a le a s in g fro m ca ch e s s le e p in g p ro ce s s e s s lice s

s lo t in d e xe s s lo t u s a g e s e q u e n ce n u m b e rs s lo t s 2 n d S m a ll Co m p u t e r S ys t e m In t e rfa ce s [ S e e S CS Is ] S MP ( s ym m e t ric m u lt ip ro ce s s in g ) 2 n d 3 rd s ys t e m s , t im e ke e p in g in s o cke t b u ffe rs s o cke t co n t ro l m e s s a g e s s o cke t s in it ia liza t io n s o ft irq s 2 n d t a s kle t s , co n t ra s t e d wit h s o ft wa re in t e rru p t s 2 n d s o ft wa re t im e rs s o u rce co d e GNU/ Lin u x ke rn e l vs . co m m e rcia l d is t rib u t io n s s o u rce co d e a n d in s t ru ct io n o rd e r s o u rce co d e d ire ct o ry t re e s p e cia l file s ys t e m s 2 n d s p e cific ca ch e s s p in lo cks 2 n d g lo b a l ke rn e l lo cks S S E e xt e n s io n s ( S t re a m in g S IMD Ext e n s io n s ) s t a ck s e g m e n t re g is t e rs s t a ck s e g m e n t s s t a t ic d is t rib u t io n o f IRQs s t a t ic lib ra rie s s t a t ic p rio rit y s t a t ic ro u t in g t a b le s t a t ic t im e rs s t a t u s re g is t e rs s t icky fla g s t ra t e g y ro u t in e s u id ( S e t Us e r ID) s u p e rb lo ck s u p e rb lo ck o b je ct s 2 n d s u p e rb lo ck o p e ra t io n s s u p e rfo rm a t u t ilit y p ro g ra m s u p e ru s e r s u p e rvis o r s wa p a re a s 2 n d a ct iva t io n a ct iva t io n s e rvice ro u t in e d e a ct iva t io n s e rvice ro u t in e d e s crip t o rs fo rm a t m u lt ip le a re a s , a d va n t a g e s p a g e s lo t s a llo ca t in g a n d re le a s in g p rio rit iza t io n s wa p - in a n d u p d a t in g fu n ct io n s wa p ca ch e s h e lp e r fu n ct io n s s wa p o ff p ro g ra m

s wa p o n p ro g ra m s wa p p e d - o u t p a g e id e n t ifie rs p a g e t a b le va lu e e n t rie s s wa p p e r 2 n d 3 rd t im e s h a rin g a n d s wa p p in g d ra wb a cks pages ch o o s in g d is t rib u t io n fu n ct io n s fo r Le a s t Re ce n t ly Us e d ( LRU) a lg o rit h m s p a g e fra m e re cla im in g [ S e e p a g e fra m e re cla im in g ] s e le ct io n s wa p p in g in s wa p p in g o u t t im in g o f t ra n s fe rrin g o f p ro ce s s a d d re s s s p a ce p u rp o s e s h a re p a g e s wa p p in g s wa p a re a s [ S e e s wa p a re a s ] s ym b o lic lin ks 2 n d s ym m e t ric m u lt ip ro ce s s in g [ S e e S MP] s yn ch ro n iza t io n p rim it ive s a t o m ic o p e ra t io n s ch o o s in g a m o n g , co n s id e ra t io n s co m p le t io n s ke rn e l d a t a s t ru ct u re s , a cce s s u s in g m e m o ry b a rrie rs s e m a p h o re s s p in lo cks s yn ch ro n o u s e rro rs o r e xce p t io n s s yn ch ro n o u s in t e rru p t s s yn ch ro n o u s re a d - a h e a d o p e ra t io n s 2 n d s ys t e m a d m in is t ra t o rs s ys t e m b o o t s t ra p I/ O APIC in it ia liza t io n s ys t e m ca ll d is p a t ch t a b le s s ys t e m ca ll n u m b e rs s ys t e m ca ll s e rvice ro u t in e s s ys t e m ca lls 2 n d d yn a m ic a d d re s s ch e ckin g file - h a n d lin g h a n d le r a n d s e rvice ro u t in e s in it ia lizin g ke rn e l wra p p e r ro u t in e s n e t wo rkin g , re la t e d t o p a ra m e t e rs p a s s in g ve rifyin g POS IX APIs a n d p ro ce s s a d d re s s s p a ce s , a cce s s in g p ro ce s s s ch e d u lin g , fo r 2 n d

p ro ce s s e s , s u s p e n d in g re a l- t im e p ro ce s s e s , re la t e d t o re e xe cu t in g s ig n a ls 2 n d ch a n g in g a ct io n o f s ys t e m ca ll fu n ct io n t im in g m e a s u re m e n t s , re la t e d t o Virt u a l File s ys t e m , h a n d le d b y s ys t e m co n cu rre n cy le ve l s ys t e m g a t e s s ys t e m lo a d S ys t e m S e g m e n t s ys t e m s t a rt u p BIOS b o o t lo a d e r Lin u x b o o t fro m flo p p y d is k fro m h a rd d is k s ys t e m s t a t is t ics , u p d a t in g b y ke rn e l S ys t e m V IPC ( In t e rp ro ce s s Co m m u n ica t io n ) [ S e e a ls o in t e rp ro ce s s co m m u n ica t io n s ] 2 n d 3 rd IPC id e n t ifie rs IPC ke ys IPC re s o u rce s 2 n d 3 rd IPC s h a re d m e m o ry re g io n m essages s e m a p h o re s s h a re d m e m o ry d a t a s t ru ct u re s d e m a n d p a g in g p a g e s wa p p in g s h m file s ys t e m s ys t e m ca lls

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] Ta b le In d ica t o r ta sk ga te s t a s k p rio rit y re g is t e rs ta sk que ue s Ta s k S t a t e S e g m e n t De s crip t o rs ( TS S Ds ) 2 n d Ta s k S t a t e S e g m e n t s ( TS S s ) 2 n d t a s k s wit ch t a s kle t d e s crip t o rs t a s kle t s 2 n d s o ft irq s , co n t ra s t e d wit h t a s ks TCP/ IP n e t wo rk a rch it e ct u re t e m p o ra ry ke rn e l m a p p in g s p e rm a n e n t ke rn e l m a p p in g s , co n t ra s t e d wit h t e xt s e g m e n t s t h re a d g ro u p s t h re a d s 2 n d t h re e - le ve l p a g in g 2 n d Pe n t iu m p ro ce s s o rs a n d t icks t im e m u lt ip le xin g t im e q u a n t u m t im e s h a rin g in t h e CPU Tim e S t a m p Co u n t e r t im e - o u t s t im e - s h a rin g t im e ke e p in g s ys t e m ca lls re la t e d t o t im e a n d d a t e u p d a t e s t im e ke e p in g a rch it e ct u re in it ia liza t io n , m u lt ip ro ce s s o r s ys t e m s m u lt ip ro ce s s o r s ys t e m s u n ip ro ce s s o r s ys t e m s t im e r in t e rru p t s t im e rs re a l- t im e a p p lica t io n s a n d t im in g m e a s u re m e n t s via h a rd wa re t yp e s TLBs ( Tra n s la t io n Lo o ka s id e Bu ffe rs ) h a n d lin g t o p h a lve s To rva ld s , Lin u s t ra n s a ct io n s Tra n s la t io n Lo o ka s id e Bu ffe rs [ S e e TLBs ] t ra p g a t e s 2 n d t ra p s TS C ( Tim e S t a m p Co u n t e r) t wo - le ve l p a g in g

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] UID ( Us e r ID) um ask u n ip ro ce s s o r s ys t e m s t im e ke e p in g a rch it e ct u re u n it ia lize d d a t a s e g m e n t s u n ive rs a l s e ria l b u s e s ( US Bs ) Un ix p ro ce s s m a n a g e m e n t S ys t e m V in t e rp ro ce s s co m m u n ica t io n Un ix file s ys t e m a cce s s rig h t s d ire ct o ry s t ru ct u re file - h a n d lin g s ys t e m ca lls file s file t yp e s Un ix ke rn e ls [ S e e u n d e r ke rn e ls ] Un ix o p e ra t in g s ys t e m s Lin u x a n d US Bs ( u n ive rs a l s e ria l b u s e s ) u s e r co d e s e g m e n t use r da ta se gm e nt u s e r g ro u p Us e r ID ( UID) 2 n d Us e r Mo d e 2 n d e xce p t io n s in Ke rn e l Mo d e , co n t ra s t e d wit h m e m o ry, a llo ca t io n t o p ro ce s s e s , s yn ch ro n iza t io n o f [ S e e in t e rp ro ce s s co m m u n ica t io n s ] s ys t e m ca lls [ S e e s ys t e m ca lls ] u s e r t h re a d s

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] ve ct o rs VES A Lo ca l Bu s e s ( VLBs ) , m e m o ry m a p p in g virt u a l a d d re s s s p a ce virt u a l a d d re s s e s virt u a l b lo ck d e vice s Virt u a l File s ys t e m ( VFS ) co m m o n file m o d e l d a t a s t ru ct u re s d e n t ry o b je ct s file o b je ct s in o d e o b je ct s p ro ce s s e s s u p e rb lo ck o b je ct s d e s crip t io n d e vice file s , h a n d lin g o f Ext 2 s u p e rb lo ck d a t a file lo ckin g file s ys t e m s t yp e s o b je ct s p a t h n a m e lo o ku p s u p e rb lo ck o p e ra t io n s s u p p o rt e d file s ys t e m s s ys t e m ca lls im p le m e n t a t io n virt u a l file s ys t e m s virt u a l m e m o ry

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W ] [ X] [ Z] wa it q u e u e h e a d wa it q u e u e s wa t ch d o g s ys t e m win d o ws , ke rn e l a d d re s s s p a ce wra p p e r ro u t in e s [ S e e a ls o ke rn e l wra p p e r ro u t in e s ] 2 n d Writ e a cce s s rig h t s writ e file lo cks

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] XMM re g is t e rs

I l@ve Ru Bo a rd

I l@ve Ru Bo a rd [ S YMBOL] [ A] [ B] [ C] [ D] [ E] [ F] [ G] [ H] [ I] [ J] [ K] [ L] [ M] [ N] [ O] [ P] [ Q] [ R] [ S ] [ T] [ U] [ V] [ W] [ X] [ Z] ze ro p a g e zo m b ie p ro ce s s e s zo n e m o d ifie r zo n e s 2 n d s ize a n d a d d re s s in g

I l@ve Ru Bo a rd

All of O'Reilly

Sponsored by:

O'Re illy Ho m e

Un d e rs t a n d in g t h e Lin u x Ke rn e l, 2 n d Ed it io n

P re s s Ro o m Jo b s Re s o u rc e Ce n t e rs

By Daniel P. Bovet, Marco Cesati 2nd Edition December 2002 0-596-00213-0, Order Number: 2130 784 pages, $49.95 US, $77.95 CA, £35.50 UK

Pe rl Ja va Pyt h o n C/ C+ + S crip t in g We b

Buy from O'Reilly:

We b S e rvice s XML

Buy Online at:

-- SELECT A STORE --

GO

Ora cle Ne t wo rkin g S e cu rit y Da t a b a s e s Lin u x/ Un ix Ma cin t o s h / OS X Win d o ws . NET Op e n S o u rce Wire le s s Bio in fo rm a t ics Th e Mis s in g Ma n u a ls

On lin e P u b lic a t io n s Ma cDe vCe n t e r. co m ONDo t n e t . co m ONJa va . co m ONLa m p . co m Op e n P2 P. co m Pe rl. co m XML. co m S p e c ia l I n t e re s t Eve n t s Me e rka t Ne ws As k Tim t im . o re illy. co m Be t a Ch a p t e rs Ne ws le t t e rs Op e n Bo o ks Le a rn in g La b

Re v ie w s Reviews From Previous Edition "Would I buy the book? Undoubtedly, although I don't need it. However, if you need to understand Linux source code, then this is the essential guide." --Jan Wysocki, news@UK, June 2001 "Despite the lucid and knowledgeable writing, you'll come up against some brain-stretching complexity. Nevertheless, this book is an important addition to the Linux canon." --Steve Patient, Amazon.co.uk "Fortunately, times have changed, and now there are several good overviews of the Linux Kernel. Perhaps the most lucid is 'Understanding the Linux Kernel'." --John Lombardo, Embedded Linux Journal, June 2001 "Online documentation is prolific, but tends to be terse. Fortunately, a growing body of literature is developing, a prime example of which is O'Reilly's 'Understanding the Linux Kernel'. Readers will find much of interest in the well-written text." --Major Kearny, Book News, April 2001 "...covers a difficult-to-rasp and technical subject matter, but does it clearly and concisely...a solid grounding in the operation of the Linux Kernel. Rating 9/10." --Richard Drummond, Linux Format, March 2001 "So, taking it as a given the a book about Linux internals is a good thing, how good is this one? Happily, it's very good --better than any previous such book that I've seen. This is a good book. The authors have cracked open a large collection of code that's currently very relevant. If they are in for the long haul and release revised books in a timely way, then this will likely become and remain the definitive explanation of Linux internals." --John Regehr, slashdot.org, January 23, 2001

I n s id e O'Re illy Ab o u t O'Re illy In t e rn a t io n a l Me d ia Kit Co n t a ct Us Ca t a lo g Re q u e s t Us e r Gro u p s Writ e fo r Us

"O'Reilly continues its tradition of exhaustive and thoroughly lucid guides to all things technical with this thick guided tour of the Linux Kernal. What makes this book stand out among other guides to the Linux operating system is that it takes the time to explain why certain features of the kernal are good or bad for specific applications.It's only a matter of time before this becomes a textbook for advanced college course on operating systems. Highly recommended for serious programmers and application developers." --Netsurfer Digest, Dec 6, 2000

Ho w t o Ord e r Bo o ks t o re s

search

"An outstanding explanation of the kernel that should benefit almost any C/C++ programmer working on Linux. Any programmer who has jumped into the kernal knows there is a real need or a book that takes a reader by the hand and steps through all the major (and sometime minor) internal components and processes of the Linux kernel. Luckily "Understanding the Linux Kernel" not only does that , but it does it very well...the presentation of the material is very well executed, even by O'Reilly's normally high standards...a must-read for anyone doing non-trivial programming on Linux." --Lou Grinzo, internet.com, Dec 22, 2000 "a practical introduction to kernel internals for those who are new to the subject, and I

strongly recommend it for any programmer who's competent in C." --www.kuro5hin.org, Feb 14th, 2001 "If you have reached the point where you have learned a few simple ideas about programming in Linux and you would like to know more about kernels then this book is probably for you." --Richard Ibbotson, Sheffield Linux User's Group, Feb 2001 "I am impressed both by the depth of coverage and by the readability of the text, especially bearing in mind the somewhat geek-like nature of the subject that's being discussed. Is the best explanation of Linux kernel internals that I've seen so far. This one's sure to be a classic, buy it if you can." --Developers Review, Feb 2001 "This is a good book. The authors have cracked open a large collection of code that's currently very relevant. If they are in for the long haul and release revised books in a timely way, then this will likely become and remain the definitive explanation of Linux internals." --John Regehr, slashdot.com, January 2001 "After reading this book, you should be able to find your way through the code, distinguishing between crucial data structures and secondary ones"--in short, you'll become a true Linux hacker." --Software World, January 2001

Return to Understanding the Linux Kernel, 2nd Edition

O'Re illy Ho m e | P riv a c y P o lic y © 2 0 0 3 , O'Re illy & As s o cia t e s , In c. w e b m a s t e r@o re illy . co m